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Abatraet:  Several  preconditioned  conjugate  gradient  (PCG)-based  domain  decompoaition  tech¬ 
niques  for  self-adjoint  elliptic  partial  differential  equations  in  two  dimensions  are  compared  against 
ea^  other  and  against  conventional  PCG  iterative  techniques  in  serial  and  parallel  contexts.  We- 
consider  preconditioners  that  make  use  of  fast  Poisson  solvers  on  the  subdomain  interiors.  Sev¬ 
eral  preconditioners  for  the  interfacial  equations  are  tested  on  a  set  of  model  problems  involving 
two  or  four  subdomains,  which  are  prototypes  of  the  stripwise  and  boxwise  decompositions  of  a 
two-dimensional  region.  Selected  methods  have  been  implemented  on  the  Intel  Hypercube  by  as¬ 
signing  one  processor  to  each  subdomain,  making  use  of  up  to  64  processors.  The  choice  of  a  '"best’^ 
method  for  a  given  problem  depends  in  general  upon:  (a)  the  domain  geometry,  (b)  the  variability 
of  the  operator,  and  (c)  machine  characteristics  such  as  the  number  of  processors  available  and 
their  interconnection  scheme,  the  memory  available  per  processor,  and  communication  and  compu¬ 
tation  rates.  Wa  empfanize  the  importance  of  the  third  category,  which  has  not  been  as  extensively 
explored  as  the  firstftwo  in  the  domain  decomposition  literature  to  date. 
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1.  Introduction 


A  number  of  methods  based  on  domain  decomposition  have  been  proposed  in  recent  years  for 
the  numerical  solution  of  elliptic  partial  differential  equations.  Such  methods  are  based  upon  the 
observation  that  the  domain  of  problem  definition  may  be  regarded  as  the  union  of  two  or  more 
subdomains,  on  each  of  which  the  restriction  of  the  original  problem  may  take  on  a  particularly 
convenient  form.  Decomposition  by  domain  also  provides  a  natural  route  to  parallelism.  For 
some  problems,  these  methods  can  be  interesting  even  as  serial  algorithms,  when  the  advantages 
that  arise  from  isolating  the  subproblems  can  be  made  to  outweigh  the  extra  work  involved  in 
enforcing  the  proper  conditions  at  the  interfaces  of  the  subdomains.  A  fortiori,  they  are  interesting 
as  parallel  methods  since  reasonably  large  independent  subtasks  can  readily  be  identified.  We 
compare  the  performance  of  several  domain  decomposition  methods  and  a  class  of  undecomposed 
domain  methods  on  a  common  set  of  two-dimensional,  linear,  scalar  problems,  and  examine  the 
parallelizability  of  each.  Our  aims  are  to  point  out  the  unity  under  certain  conditions  of  methods 
which  have  been  presented  independently,  and  also  to  identify  some  problem  characteristics  which 
tend  to  favor  certain  methods  over  others  in  the  context  of  parallelism. 

Apart  from  the  advantage  of  reformulating  a  Ijirge  discrete  problem  as  a  collection  of  smaller 
problems  which  can  be  solved  independently,  there  are  at  least  two  other  motivations  for  considering 
domain  decomposition  methods,  which  may  be  present  individually  or  in  combination  with  the 
first  in  any  given  problem.  All  are  of  the  “divide-and-conquer”  type.  Domains  of  irregular  shape 
can  be  decomposed  into  subdomains  of  regular  shape  on  which  tensor-product-based  discretization 
schemes  can  be  employed,  leading  to  discrete  operators  of  regular  structure.  Also,  regions  of  relative 
nonuniformity  of  the  differential  operator,  whether  due  to  coefiicient  variability  or  even  substantially 
different  physics,  can  be  isolated  into  different  subdomains,  again  resulting  in  exploitable  locally 
regular  structure.  An  example  of  each  of  these  motivations  is  given. 

The  decomposition  topologies  considered  involve  both  simple  interfaces  (with  and  without 
overlap  regions)  and  cross-points.  We  compare  Schur  complement  matrix  methods  {e.g.,  [1,  13, 
17]),  full  partitioned  matrix  methods  based  on  block  Gaussian  elimination  not  making  explicit 
use  of  the  reduced  Schur  complement  system  {e.g.,  [4]),  and  other  methods  of  variational  type 
{e.g.,  [16]).  These  classes  of  methods  are  similar  at  the  discrete  level,  employing  preconditioned 
conjugate  gradient  (PCG)  iterations  as  the  outer  loop,  and  an  exact  equivalence  between  the 
iterates  of  the  first  two  can  be  established  under  certain  conditions.  In  all  of  these  methods 
the  largest  implicit  problems  are  Dirichlet  or  Neumann  solves  over  the  subdomains;  therefore 
an  easy  handle  on  parallelism  can  provided  from  an  a  priori  dissection  of  the  grid.  However, 
global  communication  is  required  in  forming  the  inner  products  of  the  PCG  iterations  (and  also  in 
one  class  of  preconditioning  methods),  so  the  optimum  parallel  implementation  is  not  completely 
straightforward.  The  optimum  number  of  subdomains  is  generally  both  architecture-  and  problem- 
dependent,  since  the  communication  cost  per  iteration  and  the  overall  number  of  iterations  tend  to 
increase  with  the  number  of  subdomains.  We  consider  the  standard  communication  topologies  of 
a  ring,  a  two-dimensional  mesh,  and  an  n-cube.  Onto  these  we  consider  the  natural  decomposition 
mappings:  a  decomposition  into  strips  onto  the  ring,  a  decomposition  into  boxes  onto  the  mesh, 
and  decompositions  of  both  types  onto  the  n-cube.  Serial  complexity  comparisons  and  parallel 
efficiency  comparisons  of  the  methods  are  presented. 

A  variety  of  preconditioners  have  been  proposed,  for  some  of  which  there  exist  theoretical 
results  showing  the  convergence  rate  of  the  iterations  to  be  asymptotically  independent  of  the 
spatial  resolution,  or  only  weakly  dependent  thereon.  Some  of  these  optimal  preconditioners  are 
exact  for  uniform  operators,  and  in  practice  work  best  when  the  operator  coefficients  do  not  vary 
too  much  along  the  interfaces.  For  problems  in  which  there  is  variation  along  the  interfaces,  the 
condition  number  of  the  system,  though  asymptotically  independent  of  resolution,  is  larger  and 
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it  is  interesting  to  consider  alternative  low-bandwidth  approximations  to  the  Schur  complement 
matrix,  at  least  for  sufficiently  low  resolution.  We  designate  this  type  of  preconditioning  Modified 
Schur  Complement  (MSC),  and  include  some  examples  in  our  comparisons. 

The  outline  of  this  paper  is  as  follows.  In  §2  we  review  some  recent  contributions  to  the 
domain  decomposition  literature,  summarizing  methods  and  key  theoretical  results,  and  introducing 
MSC  preconditioning.  Section  3  contains  an  experimental  comparison  of  various  methods  on  the 
same  small-scale  model  problems,  all  implemented  serially.  Parallel  implementation  issues  for 
domain  decomposition  methods  are  discussed  in  §4,  followed  by  some  preliminary  experimental 
results  generated  on  the  Intel  Hypercube.  A  theoretical  model  for  the  efficiency  of  various  parallel 
implementations  of  domain  decomposition  algorithms  on  some  model  large-scale  architectures  is 
presented  in  §5,  and  we  draw  some  conclusions  in  §6. 

2.  A  review  of  recent  PCG-based  domain  decomposition  methods 

The  substructuring  of  elliptic  partial  differential  equations  by  domain  has  served  as  a  practical 
computational  technique  for  over  twenty  years  (e.g.,  [23]),  and  its  theoretical  origins  (as  a  method  of 
proving  the  solvability  of  the  Dirichlet  problem  on  irregular  regions)  extend  back  to  the  last  century 
[28].  A  briefly  annotated  bibliography  of  various  direct  and  iterative  approaches  is  contained  in  the 
introduction  of  [l].  Here  we  summarize  only  recent  domain  decomposition  algorithms  which  make 
use  of  preconditioned  conjugate  gradient  iteration  in  the  outer  loop,  which  appear  to  have  originated 
with  [9].  There  are  other  types  of  domain  decomposition-related  methods  such  as  Schwarz- Jacobi 
[26]  and  Schwarz-multigrid  [21]  which  are  not  considered  in  this  paper. 

2.1.  Problem  definition 

The  methods  will  be  illustrated  on  a  model  second-order,  positive  definite,  self-adjoint  elliptic 
Dirichlet  problem  on  a  bounded  domain  in  with  a  piecewise  smooth  boundary; 

Lu  =  /  in  n, 
u  =  0  on  r  =  dCl, 

in  the  weak  formulation,  in  which  we  seek  u  €  ffo(^)  ^  ^  ^o(^) 

Aa(u,v)  =  (/,v), 

where 

2 

,  ,  f  ,  .  du  dv  ^ 

and 

(/,v)  =  [  fv  dx. 

Jn 

Given  a  triangulation  of  0,  we  define  the  discrete  subspace  of  Hq  consisting  of  piecewise  linear 
functions  vanishing  on  F,  .  Note  that  the  dimension  of  is  the  number  of  vertices  in  the 
interior  of  fl,  defined  as  n  .  For  all  of  the  algorithms  to  be  described  but  one,  it  is  sufficient  to 
consider  the  usual  cardinal  basis  for  consisting  of  piecewise  linear  functions  of  the  smallest 

possible  support,  denoted  =  {1,2, _ n}.  The  discrete  approximation  to  v  in  is 

then  represented  by  a  vector  of  nodal  coefficients,  Uh  ■ 

The  Galerkin  formulation  of  (2.1b)  leads  to  the  matrix  equation 


(2.16) 


(2.1o) 


where 


Figure  1:  Sample  domains  illustrating  the  triangulation  and 
substructuring  described  in  §2.1.  (a)  A  general  domain  show¬ 
ing  partitioning  into  two  subdomains  with  a  common  inter¬ 
face  (or  separator  set)  lying  along  segments  of  the  triangula¬ 
tion.  Support  of  typical  nodal  basis  functions  €  A’'i  and 
V>2y,i  €  A'o  are  shaded,  (b)  A  model  geometry  for  the  use  of 
fast  Poisson  solvers  on  the  subdomains:  a  union  of  uniformly 
triangulated  rectangles. 


IM.  =  f 

Jn 


f  in  dx,  i  e  N 


The  simplest  decomposition  on  which  all  of  the  methods  can  be  compared  is  that  involving 
two  simply  connected  subdomains.  Though  somewhat  academic  from  the  point  of  view  of  parallel 
processing,  we  consider  this  case  first  because  it  generalizes  straightforwardly  to  multiple-strip 
decompositions  and  allows  presentation  of  the  basic  ideas  without  cumbersome  notation.  For  all 
of  the  algorithms  to  be  described  but  one,  the  intersections  of  the  subdomains  are  restricted  to  be 
interfaces  of  lower  dimension.  Refering  to  Figure  la,  we  define  712  =  Sfli  n  8^2  ■ 

The  triangulation  must  be  such  that  the  segments  of  712  coincide  with  sides  of  the  triangular 
elements.  Let  /*  denote  the  restriction  of  /  to  fl*  .  With  a  slight  abuse  of  the  usual  notation,  we 
define  the  discrete  subspaces  of  functions  over  each  subdomain  which  vanish  on  the  outer  boundary, 
but  may  be  nonvanishing  on  the  interface  boundary,  H\f^^  {k  =  1,2),  and  their  subsets,  /fofc/i’ 
functions  which  vanish  also  on  712  .  The  have  cardinal  bases  {i^kj}jeNk  -  where  A’*  is  an  index 
set  for  the  nodes  interior  to  Qkt  of  dimension  rik  .  Denote  by  {i>kj}j€No'  where  A"^o  is  an  index  set 
of  dimension  no  for  the  nodes  on  the  interface,  the  bases  for  the  functions  in  Hlf,  which  vanish  at 
the  interior  nodes  of  H*,  but  have  nonvanishing  trace  on  712  . 
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Equation  (2.2)  can  be  symmetrically  permuted  into  the  form 


Aoo  Aoi  f  ^o  \  _  f  Jo\ 

■A-io  An  i  \'u\  J  \fi  J 


(2.3) 


where  the  first  block  row  corresponds  to  the  unknowns  defined  at  the  nodes  of  the  separator  set 
^12,  and  where  An  is  itself  a  2  x  2  block  diagonal  matrix,  one  block  corresponding  to  the  nodal 
values  in  each  subdomain.  In  (2.3)  and  hereafter  the  subscript  h  is  dropped  where  there  is  no 
ambiguity  between  the  continuous  and  discrete  formulations.  The  matrix  A  is  symmetric  and 
positive  definite,  as  are  its  diagonal  blocks,  by  the  hypotheses  preceding  (2.1).  Consequently,  the 
interior  unknowns  may  be  formally  eliminated  in  forming  the  Schur  complement  system  (sometimes 
denoted  the  capacitance  system) 


C«o  =  p, 


(2.4) 


where  C  =  Aoo  -  AqiA^/Aio  and  g  =  Jo  -  AoiA^jVi  • 

As  the  Schur  complement  of  Aoo  in  A,  C  is  also  symmetric  and  positive  definite.  Construction 
of  the  right-hand  side  of  (2.4)  requires  one  solve  on  each  subdomain  with  homogeneous  Dirichlet 
conditions  on  the  interface.  Having  solved  (2.4)  for  uo,  the  interior  unknowns  can  be  recovered 
from  the  lower  block  row  of  (2.3),  again  at  the  cost  of  one  solve  on  each  subdomain  with  uq  as 
inhomogeneous  Dirichlet  data. 

In  the  two-subdomain  case  it  is  helpful  for  illustrative  purposes  to  rewrite  (2.3)  in  the  expanded 

form 


Aoo 

•^10 

Ai;’ 

0 

^10 

0 

•^11  J 

(2.5) 


where  and  are  the  unknowns  in  subdomains  1  and  2,  respectively,  and  to  decompose  the 
interface  diagonal  block  as 


Aoo  =  Affo^  +  A^\ 


where 


:'^00 


V'  f 


dx. 


Uj  e  A'o- 


The  dimensions  of  a[\*  and  Aj^'  are  ni  and  n2,  respectively,  with  n  =  no  +  nj  -|-  n2. 

The  case  of  Poisson’s  equation  on  a  union  of  rectangles,  uniformly  triangulated  as  in  Figure  lb, 
will  be  frequently  considered  in  the  sequel.  For  this  so-called  Courant  triangulation,  fast  Poisson 

(k) 

solvers  [29]  may  be  used  to  invert  the  A)j  . 

Since  it  explicitly  involves  the  inverses  of  the  aJj\  the  matrix  operator  C  can  be  expensive 
to  construct,  generally  requiring  no  solves  on  each  subdomain.  Historically,  domain  decomposition 
was  first  approached  in  this  way  [23] ,  and  it  remains  a  useful  procedure  when  the  formation  of  the 
factors  of  C  can  be  amortized  over  a  large  number  of  right-hand  sides,  for  instance  when  the  same 
discrete  linear  system  arises  at  successive  steps  in  a  time-dependent  problem  {e.g..  [22]). 

For  cases  in  which  the  construction  of  C  cannot  be  so  amortized,  for  instance  when  a  new 
system  like  (2.2)  arises  at  each  step  in  a  nonlinear  problem,  more  efficient  methods  have  been 
developed  by  means  of  preconditioned  conjugate  gradient  iteration,  which  at  each  step  require  a 
matrix-vector  multiply  involving  C,  without  its  explicit  construction.  Each  PCG  iteration  still 
requires  one  solve  on  each  subdomain,  so  the  effectiveness  of  the  PCG-based  techniques  depends  on 
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keeping  the  number  of  steps  small  through  a  preconditioning  which  is  considerably  less  expensive 
to  construct  ajid  apply  than  C~^  itself. 

Of  course,  PCG  iterations  can  also  be  used  to  solve  the  full  systems  in  the  form  of  (2.2) 
and  (2.3),  the  type  of  preconditioning  which  is  natural  and  efficient  being  different  in  each  case.  In 
subsequent  sections  we  make  some  comparisons  the  effectiveness  and  generality  of  techniques  which 
take  each  of  these  three  formulations  as  their  starting  points.  We  refer  to  methods  which  depart 
directly  from  (2.2)  as  global  matrix  methods  (GMM),  from  (2.3)  as  partitioned  matrix  methods 
(PMM),  and  from  (2.4)  as  Schur  complement  matrix  methods  (SCM).  For  subsequent  reference  to 
individual  steps,  we  present  below  the  form  of  the  PCG  algorithm  used  in  applications  for  solving 
the  symmetric  n  x  n  linear  system  Ax=  b  with  (symmetric)  preconditioner  B: 


Algorithm  PCG 

Choose  initial  iterate: 

a:®,  arbitrary 

(PCG.l) 

Compute  initial  residual: 

—  6  -  Ax^ 

[PCG.2) 

Compute  preconditioned  residual: 

iPCG.3) 

Initialize  direction: 

(PCGA) 

Compute  B-inner-product: 

(PCG.5) 

For  k  —  0  Step  1  Until  Convergence,  Do 

Compute  matrix-vector  product: 

9*  - 

(PCG.6) 

Compute  >I-inner-product: 

(PCG.7) 

Compute  step  length: 

a*  7*/r* 

(PCG.S) 

Update  solution: 

X  - —  X  +  a  p 

(PCG.9} 

Compute  new  residual: 

r*"*'*  ■>—  r*  -  a*?'* 

(PCG. 10) 

Compute  new  preconditioned  residual: 

(PCG. 11) 

Compute  S -inner-product: 

(PCG.  12) 

Page  7 


Compute  orthogonalization  coefficient: 


{PCG.13) 


Update  direction: 


p*+J  ^  +  0i=pi‘ 


(PCG.U) 


End  For 


In  addition  to  the  storage  and  workspace  requirements  for  the  application  of  A  and  the 
algorithm  requires  storage  for  four  vectors  of  length  n.  As  is  well  known  [e.g.,  [9]),  the  >!-norm  of 
the  error  at  the  kth  iteration  of  PCO  is  bounded  according  to 


-X  11/ 


y/^-  1 
v/k+  1 


where  k  is  the  condition  number  of  B~^A  ,  and  the  algorithm  produces  exact  convergence  in  a 
number  of  iterations  which  is  at  most  the  number  of  distinct  eigenvalues  of  B~^A.  The  operators 
A  and  B  are  said  to  be  spectrally  equivalent  if  there  exist  positive  constants  70  and  71  independent 
of  the  discretization  such  that  for  all  i,  'jq{x,Bx)  <  {x,Ax)  <  7i(i,Sx). 

2.2.  Schur  Complement  Methods  (SCM) 

The  Schur  complement  methods  are  realized  by  taking  A  in  Algorithm  PCG  to  be  C  and 
selecting  appropriate  B.  The  iterations  occur  on  vectors  of  length  no-  Note  that  by  construction 
the  product  Cp*  consists  of  the  residuals  at  the  nodes  along  712  of  the  original  discrete  operator 
applied  to  the  vector  which  satisfies  the  discrete  equations  with  homogeneous  boundary  conditions 
on  r  in  each  of  the  subdomains  and  equals  p*  on  712  .  Four  related  choices  of  B  are  reviewed  in 
this  subsection. 

Dryja  [13,  14]  showed  that  C  of  (2.4)  is  spectrally  equivalent  to  the  matrix  where  K 

is  the  tridiagonal  matrix  of  order  no  with  diagonal  elements  2  and  off-diagonal  elements  -1,  the 
discrete  Laplacian  operator  over  a  uniform  grid  in  one  dimension.  For  notational  convenience,  we 
define  the  average  nodal  spacing  along  the  interface,  h  =  (no  +  1)“^-  Inserting  a  factor  of  2  for 
consistent  scaling  across  the  methods  to  follow,  Dryja’s  preconditioner  has  the  eigendecomposition 

Md  = 


where 


=  y/^sinijTrh, 


\d  -  diag{Xf), 

where 

Af  =  2^, 


(Tj  =  4s\n^—. 

The  condition  number  of  which  is  equal  to  A„o/Ai,  grows  in  proportion  to  the  number  of 

interfacial  unknowns  no  as  this  number  becomes  large.  Through  their  spectral  equivalence,  the  same 
is  true  of  C.  which  implies  that  the  convergence  of  the  unpreconditioned  Schur  complement  sj'stem 


. 
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iteration  will  deteriorate  as  the  mesh  is  refined.  The  action  of  on  a  vector  can  be  computed 
with  a  pair  of  sine  transforms  in  O{nologno)  operations,  hence  the  cost  of  the  preconditioning  is 
inexpensive  in  comparison  with  the  cost  of  forming  a  matrix  multiplication  with  a  general  C.  which 
involves  two-dimensional  subdomain  solves. 

By  means  of  FFT’s  in  one-dimension,  Dryja  also  gave  an  O(7iologno)  method  for  performing 
the  subdomain  solves  for  the  case  L  =  —A  and  H  a  union  of  two  rectangular  subregions,  by 
exploiting  the  regular  sparsity  pattern  of  the  right-hand  sides  arising  during  the  PCG  iterations. 
In  this  case  the  overall  operation  count  for  solving  (2.2)  is  dominated  by  the  pre-  and  post-processing 
involved  in  forming  g  of  (2.4)  and  backsolving,  instead  of  by  solving  the  Schur  complement  system. 
For  more  general  operators  and  domain  geometries,  the  cost  of  one  PCG  iteration  is  comparable 
to  the  pre-  and  post-processing. 

Golub  and  Mayers  [17]  arrived  heuristically  at  a  preconditioner  for  C  by  starting  from  the 
observation  that  the  elements  [C\ij  may  often  be  well  approximated  as  constant  along  a  diagonal. 
In  other  words,  the  influence  of  the  data  at  interfacial  node  j  on  the  residual  at  interfacial  node  i 
depends  (approximately)  only  on  the  discrete  distance  along  the  interface.  [/  -  jj  .  This  led  Golub 
and  Mayers  to  solve  a  discrete  infinite  domain  Laplace  problem  with  an  infinite  interface  dividing 
two  half-planes,  the  data  on  the  interface  and  at  infinity  prescribed  to  be  zero  everywhere  except 
at  the  origin,  where  it  was  taken  as  one.  The  ijth  element  of  their  preconditioner  of  Toeplitz 
form  was  then  defined  as  the  residual  of  the  discrete  Laplacian  at  the  point  on  the  interface  at  a 
distance  j?  -  j\  from  the  origin.  This  preconditioner  was  tested  and  found  superior  to  Dryja's  on 
the  problem  considered  in  [17|.  Though  computationally  complicated  to  construct,  they  noticed 
that  a  second.  FFT-implementable  preconditioner  could  be  derived  by  replacing  a  term  in  their 
generating  function  expression  for  the  interface  residual  by  Dryja’s  preconditioner.  Their  result. 

Mg  = 

where 

where 


or 

Mg  =  2(71  + 

has  been  also  employed  as  an  effective  refinement  of  Md  in  other  investigations  [1,  11], 

Bjorstad  and  Widlund  [l]  showed  that  C,  -  Aoj^(-4jj^)“^a[o\  and  - 

Aoi'(Aii^)“'Aio\  are  all  spectrally  equivalent.  (Note  that  C  =  Assuming  that  f2  is 

decomposed  in  such  a  way  that  it  is  computationally  convenient  to  solve  Neumann  problems  on  one 
of  the  subdomains,  say  flj,  with  zero  Dirichlet  data  on  dili  n  F  and  natural  boundary  conditions 
on  ^12  ,  they  proposed  as  a  preconditioner  for  C,  acting  on  a  suggestion  of  Dryja’s.  Applying 
(Cl^))“*  to  a  vector  po  of  dimension  no  requires  solving  the  Dj  Neumann  problem 


whence  *Po  •  In  the  case  of  Poisson's  equation  on  the  union  of  two  rectangular  uniformly 

gridded  regions,  as  pictured  in  Figure  lb.  an  explicit  FFT-implementable  expression  for  was 


Ag  =  diag(Xf), 


Page  9 


Figure  2:  The  applicability  of  the  Chan  preconditioner,  (a) 

A  model  geometry  for  which  Me  =  C.  (b)  A  model  geometry 
for  which  Me  is  not  exact  but  furnishes  a  useful  aspect  ratio- 
sensitive  preconditioner. 

derived  in  [l].  We  shall  define  Mb  as  twice  in  what  follows  for  purposes  of  comparison.  Let 
mi  be  the  number  of  internal  grid  points  in  the  vertical  direction  in  flj  .  Then 

Mb  =  WKbW^, 

where 
where 


where 
and 

r;±  =  l+y± 

This  is  a  diagonal  rescaling  of  Me  which  takes  account  of  the  aspect  ratio,  mi /no.  of  one  of 
the  subdomains.  In  the  case  where  f2  is  a  rectangle  and  f2i  and  ^2  are  symmetrically  disposed 
about  the  interface,  Mb  =  C.  and  the  Bjorstad-Widlund  method  converges  in  one 

step. 

Chan  has  carried  this  circle  of  ideas  further  for  the  case  where  fi  is  a  rectangle  and  L  is  the 
Laplacian  by  finding  the  Fourier  decomposition  of  C  itself  [6].  Referring  to  Figure  2a.  let  m2  be 
the  number  of  internal  grid  points  in  the  vertical  direction  in  VL-i. 

Then  in  the  previous  notation. 

Me  = 


As  =  diag{Xf), 


Pj  — 


V  •*4'“^**  M  '  it  ^  ^  *  "  >*  '  ^ 
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where 

Ac  =  diagiXf), 

where 

j  1  -  p"‘>+i  ^  1  -  j  V  '  ^  4  ■ 

By  taking  account  of  the  aspect  ratio  of  both  of  the  subdomains.  Chan's  method  converges 
in  one  step  even  in  the  unsymmetric  case,  and  thus  may  be  regarded  as  a  direct  fast  Poisson 
solver.  Of  course,  fast  Poisson  solvers  not  making  use  of  domain  decomposition  can  already  be 
applied  when  0  is  a  rectangle,  and  the  serial  computational  complexity  of  Chan's  approach  can  be 
made  nearly  equal  to  that  of  these  algorithms  by  solving  the  Schur  complement  system  in  Fourier 
transform  space  [24],  However.  Chan's  preconditioner  may  also  be  used  to  advantage  in  a  more 
general  geometry  like  that  of  Figure  2b.  for  which  a  fast  Poisson  solver  is  not  available,  although 
exact  only  for  the  inscri’oed  no  x  (mi  +  m2)  rectangle. 

In  aid  of  visualizing  these  preconditioners,  surface  plots  of  their  elements  as  a  function  of  their 
indices  are  given  in  Figure  3  below  for  the  case  no  =  15,  mi  =  m2  =  7. 

Corresponding  estimates  of  the  condition  numbers  k{M~^C).  where  C  derives  from  the  Lapla- 
cian  on  the  unit  square,  with  16  subintervals  on  a  side  are  listed  along  with  convergence  data  in 
the  second  row  of  Table  1  in  §3. 

The  Chan  and  Bjorstad-Widlund  methods  are  both  exact  for  the  Laplacian  in  a  symmetrically 
decomposed  rectangular  domain.  For  a  given  A.  the  Golub-Mayers  method  coincides  with  these 
two  methods  in  the  infinite  aspect  ratio  limit  of  the  domain  geometry: 


The  Dryja  preconditioner  is  not  exact  for  any  domain  geometry  limit.  All  four  of  the  preconditioners 
discussed  above  are  optimal  for  the  two-subdomain  case  in  the  sense  that  the  condition  number  of 
the  Schur  complement  system  approaches  a  constant  independent  of  as  the  mesh  is  refined  with 
fixed  geometry. 

A  disadvantage  of  working  with  the  Schur  complement  system,  independent  of  preconditioner, 
is  that  the  subdomain  solves  are  presumed  to  be  carried  out  exactly.  In  the  general  nonseparable 
case  of  interest,  fast  Poisson  solvers  are  not  available,  and  one  is  faced  with  the  necessity  of  direct 
sparse  solvers  in  the  computation  of  Cp^.  or  of  nested  iterations.  In  the  latter  case,  the  convergence 
criterion  for  the  inner  iterations  could  conceivabl)'  be  tuned  to  the  rate  of  progress  of  the  outer 
iteration  to  save  computational  work,  but  a  more  fully  coupled  framework  of  simultaneous  iteration 
is  possible,  which  we  review  in  the  next  subsection. 

2,3.  Partitioned  Matrix  Methods  (PMM) 

The  partitioned  matrix  methods  are  realized  by  taking  to  be  A  of  (2.3)  and  selecting  appro¬ 
priate  B.  The  iterations  occur  on  vectors  of  length  n,  rather  than  no-  We  begin  with  the  following 
theorem,  due  to  Eisenstat  [15]. 

Theorem  2.1.  Let  the  n  x  n  partitioned  matrix  A  and  the  no  x  no  Schur  complement  matrix  C  be 
defined  as  in  (2.3)  and  (2.4),  respectively,  with  corresponding  right-hand  sides  f  and  g. 

(i)  Algorithm  PCG  applied  to  Cv  =  g  with  initial  iterate  and  preconditioner  M  is  equivalent  to 
algorithm  PCG  applied  to  Au  =  f  with  initial  iterate 

(a-*(/,  -  .4,ot”)) 


Figure  3:  Surface  plots  of  the  elements  of  the  Schur  com 
ment  matrix  C  for  Poisson's  equation  on  a  square  unifoi 
discretized  with  16  subintervals  on  a  side,  and  various 
conditioners.  (al  C  itself,  fbl  Mn.  (cl  Mn,  (dl  Mn{=  A 


azid  preconditioner 


B  = 


M  +  AoiAjj^  Aio 

Mo 


Aoi 

■^11 


in  the  sense  that,  for  all  h  >  0, 


,.k 


^11  ih  -  Mov'') 


(2.6) 


(ii)  There  is  no  advantage  to  choosing  an  initial  iterate  more  general  than  t/°  as  above,  in  the  sense 
that  ||i/*  ~  'u|U  <  IK'*  —  w|U  !  w'iere  u'*  is  the  kth  iterate  generated  by  PCG  from  the  initial  iterate 


w 


0 


Proof,  (i)  Since  An,  M,  and  B  are  symmetric  positive  definite  matrices,  we  may  write  M  =  Q^Q 
and  B  =  P^P,  where 


The  Schur  complement  system  PCG  iteration  is  equivalent  to  conjugate  gradient  (CG)  iteration  on 
Cv  =  g  with  initial  iterate  =  0,  where  C  =  Q~^CQ~^ ,  v  =  Q{v—v^°^)  and  g  =  Q~^ [g  —  Cv^^^)-. 
and  similarly  the  partitioned  svstem  PCG  iteration  is  equivalent  to  CG  iteration  on  Ail  =  /  with 
initial  iterate  ii(°)  =  0,  where  A  =  P“^AP“*,  u  —  P(u  —  and  /  =  P~^{f  —  Au^°^).  But  note 
that 


A  = 


nT  aT  4-1/2 
V  ^10^11 


A 


1/2 

11 


-1  r 


Aoo  Aoi 

Aio  All 


Tl 


1-1/2. 


1/2 


‘11  "^10  ^11 


Q-’^  -Q-'^aLA-^ 


10^11 
^11 


Aoo  -K)! 

Alo  AiiJ  [-An^AioQ 


1-1 


-1 


.-1/2 

1^11 


Q  ^(Aoo  -  AoiAj/Aio)  0 


^11 


^10 


.1/2 

111  J 


~M[iAioQ 


A-^/^ 


0 

-1 


Q-^CQ-^  o' 

c  o' 

0  I 

0  I 

Similarly, 


/  = 


nT  aT  4-1/2 
V  -^10^11 


Tl 


.1/2 

111 


^/o  -  ^oiAi/(/i  -  Aior°)  -  Aoov°^  _ 


t-T 


0 


—Q  ^AfoA^i* 
j-1/2 


•■11 


tn-{ 


Thus  the  partitioned  matrix  PCG  iteration  is  equivalent  to  CG  iteration  on 


C  0 
0  / 


I -{I 


with  initial  iterate  =  0.  =  0  and  the  equivalence  is  established. 
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(it)  To  prove  optimality,  we  first  define  =  P{w^  —  u®),  and  note  that 


=  ll«’*  -  tiii^ 

-min  (  -iI-CPk{C))v\  ' 

=  min  {||(/-  C'Pt(C'))r^||^  +  ||(1  -  Pt(l))A\Ce\\l} 
>xmn{\\{I-CP,{C)}v\\l} 

=  k*  -  *11^ 

where  P*  is  the  set  of  polynomials  of  order  k,  and  the  second  and  penultimate  steps  rely  on  the 
optimality  property  of  CG  iteration  (see  [8],  chapter  3,  for  instance). 

I 

By  this  theorem,  any  Schur  complement  system  preconditioner  M  can  be  applied  to  the  in¬ 
terfacial  equations  of  the  partitioned  matrix  by  its  incorporation  into  the  matrix  B  as  shown  in 
(2.6).  Moreover,  the  theorem  suggests  a  form  for  the  preconditioner  for  a  more  general  algorithm 
in  which  is  it  not  required  that  the  subdomain  solves  corresponding  to  be  carried  out  exactly. 
For  instance,  if  A  derives  from  a  nonseparable  operator  L,  fast  Poisson  solvers  may  be  used  to 
precondition  the  subdomain  solves.  More  costly  exact  subdomain  solves  are  thereby  avoided.  To 
allow  for  this  type  of  generality,  we  take  B  to  be  B,  where 


Poo 

Poi 

_  M  AoiA^j*Aio 

■^1 

.  PlO 

Pn. 

^10 

A\  \ . 

and  where  the  Aij  are  determined  on  a  problem-specific  basis.  A  convenient  form  for  the  A,y  may 
be  derived  from  a  permutation  conformal  to  that  of  (2.3)  of  the  discrete  Galerkin  equations  for 


where 


where  for  each  subdomain 


An(u,t;)  =  (/,v), 


An(«,v)  =  ^Af)*(«,v), 
t=i 

An,(u,r)=  ^  L^POdTdr 


and  where  the  constants  2^  are  chosen  for  each  subdomain  k  in  such  a  way  that  the  resulting 
matrix  A  is  spectrally  equivalent  to  A.  This  iteration  is,  of  course,  no  longer  equivalent  to  any 
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reduced  system  iteration  in  the  manner  described  above,  and  the  scaling  of  M  relative  to  the  A,y 
becomes  an  issue,  where  it  was  not  previously. 

An  efficient  implementation  of  partitioned  matrix  preconditioners  of  the  form  (2.7)  has  been 
presented  by  Bramble,  Pasciak  4:  Schatz  [4],  wherein  it  is  described  how  the  operation  B~^r  can 
be  carried  out  at  the  cost  of  two  subdomain  solves  per  subdomain  per  iteration,  plus  the  cost 
of  applying  M~^ .  In  the  language  of  function-space  decompositions,  the  solution  n  is  written  as 
the  sum  of  a  harmonic  component,  u^,  and  a  perpendicular  component,  u^,  such  that  in  each 
subdomain  fc,  u^\k  €  ifofc/i  satisfies 

=  Anju.u) 

for  all  V  €  ^okh'  ^  ^Ih  satisfies  =  v  on  'yij  and 


for  all  V  €  Bfok/i- 

In  matrix  terms,  to  solve 


for  u,  one  first  solves 


for  uf ,  then 


for  uoi  tlien 


Ant(v^,v}  =  0 


(")=(/:) 


Aii«f  =  /i 


AIuq  =  /o  ~ 


Aiittf  =  -Aiouo 

for  uf  and  sets  uj  =  -f  uf . 

It  is  tempting  to  consider  as  a  more  economical  alternative  for  B  the  matrix 

'  M  p  ' 

Aio  All  ’ 

whose  inversion  allows  skipping  the  first  of  the  three  steps  above  (and  thus  leads  to  an  algorithm 
with  no  more  subdomain  solves  per  iteration  than  SCM),  but  this  is  not  a  symmetric  preconditioner. 
The  penalty  for  preserving  a  nest-free  iterative  structure  in  the  case  where  the  subdomain  solves 
with  All  are  too  expensive  to  be  done  directly  is  thus  an  extra  solve  per  subdomain  with  An  in 
the  preconditioning  step,  and  also  the  extra  work  in  the  multiplication  and  dot  product  steps  due 
to  the  vector  length  of  n  instead  of  no- 

In  [5j  the  same  authors  consider  a  generalization  of  the  Bjorstad-Widlund  preconditioning  in 
which  the  M  of  (2.7)  is  given  by 


In  matrix  terms,  to  solve 


(:)=(/:) 
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for  u,  one  first  solves  the  Dirichlet  problem  in  02 : 

then  the  Neumann  problem  on  Oi: 


Since  one  of  the  subdomains  solves  is  eliminated,  the  serial  complexity  of  this  method  is  less  than 
that  of  the  function-space  decomposition  technique  just  described,  but  the  remaining  subdomain 
solves  are  inherently  sequential,  and  generalization  to  the  many-subdomain  case  can  only  be  cairried 
out  for  decompositions  (such  as  stripwise)  in  which  the  boundaries  of  the  subdomains  on  which  the 
natural  boundary  conditions  are  posed  do  not  intersect. 


2.4.  Modified  Schur  Complement  (MSC)  preconditioners 

Because  the  Schur  complement  matrix  C  is  close  to  being  tridiagonal,  as  illustrated  for  the 
Laplacian  operator  in  Figure  3,  it  is  natural  to  consider  the  use  of  tridiagonal  or  other  low-bandwidth 
preconditioners  for  it.  Such  an  approximation  of  one  matrix  by  another  constrained  to  satisfy 
various  sparsity  requirements  has  often  proved  useful  in  connection  with  iterative  methods.  We 
can  generate  a  class  of  interfacial  preconditioners  of  the  form 


^^S(k)  —  ^  (2.9a) 

where  Ek  is  a  symmetric  matrix  with  semi-bandwidth  k  which  satisfies 

EkVi  —  (^iAfj*.Aio)v,-  (2.96) 

for  some  k  +  I  vectors  v,-,  t  =  0,...,k.  We  have  used  A\\  rather  than  An  in  (2.9b)  because  this 
technique  is  motivated  by  variable  coefficient  problems  for  which  exact  subdomain  solvers  are  too 
expensive  to  consider. 

Note  that  the  construction  of  Ek  requires  k-{-  1  solves  on  each  subdomain,  which  is  in  general 
(k-l-l}/2  times  the  cost  of  one  preconditioning  step  with  B  of  the  form  (2.7).  For  ^  =  0,  we  consider 
the  vector  uq  =  (1?  A:  =  1,  the  vectors  vo  =  (1,0, 1,0, . .  .)^  and  vi  =  (0, 1,0, 1, . .  .)^;  and 

so  forth.  Equation  (2.9b)  gives  {k  +  l)no  scalar  equations  for  the  {k  +  l)(no  —  k/2)  distinct  elements 
of  Ek-  but  the  overdetermination  is  consistent  due  to  the  symmetry  of  the  product  of  matrices  on 
the  right-hand  side.  For  the  set  of  v,-  recommended  above,  the  diagonal  elements  of  Ek  can  be  read 
off  from  (2.9b)  and  the  remaining  elements  can  be  obtained  in  O{kno)  operations.  Note  that 
can  also  be  applied  at  the  cost  of  only  O(fcno)  operations  once  its  factorization  is  stored. 

Surface  plots  of  Ms(2)  for  same  model  problem  described  in  conjunction 

with  Figure  3  are  shown  in  Figure  4,  and  their  associated  estimated  condition  numbers  also  appear 
in  Table  1. 

For  small  k,  which  is  the  only  practical  limit,  this  class  of  preconditioners  turns  out  to  be 
markedly  inferior  to  the  optimal  class  described  earlier  for  constant  coefficient  operators  on  uniform 
grids.  However,  it  can  be  competitive  when  the  interfaces  are  not  placed  along  level  curves  of  the 
coefficients.  Figure  5  shows  surface  plots  of  the  first  three  members  of  the  Ms{k)  class  and  the  Schur 
complement  matrix  itself  for  a  problem  with  the  same  model  geometry,  but  with  the  nonseparable 


Figure  4:  Surface  plots  of  the  elements  of  the  Schur  comple¬ 
ment  matrix  C  for  Poisson's  equation  on  a  square  uniformly 
discretized  with  16  subintervals  on  a  side,  and  the  first  three 
MSC  preconditioners,  (a)  C  itself,  (b)  Ms(o)-  (c)  Ms{i)>  (d) 


operator  equation  V  (aVu)  =  /,  where  a  =  H-6tan“^(a--  5)  +  ctaD“'(d(y-  5)),  and  with  all  of  the 
tilde-quantities  in  (2.9)  obtained  from  the  subdomain  averaging  of  the  x-  and  y-direction  diffusion 
coeflRcients  in  the  opposite  direction  to  achieve  separability  within  subdomains.  Estimated  condition 
numbers  for  this  problem,  using  PMM  with  interfacial  preconditioners  from  both  the  optimal  and 
MSC  classes  appear  in  Table  7  of  §3.2.  Observe  the  sensitivity  of  the  MSC  preconditioners  to  the 
variation  of  a  along  the  interface,  which  is  lacking  in  the  optimal  preconditioners.  The  plots  are 
for  b  =  0.65,  c  =  0.35,  d  =  10.0. 

The  MSC  preconditioners  have  the  advantage  of  being  self-scaling,  relative  to  the  and 
can  enjoy  a  decisive  advantage  over  a  poorly  scaled  optimal  interface  preconditioner  in  the  PMM 
context.  However,  it  is  not  difficult  to  choose  a  good  scading  for  the  optimal  preconditioners  (see 

[4]). 

2.5.  Methods  of  Variational  Type 

Glowinski  and  several  coauthors  have  contributed  extensively  to  the  literature  of  domain  de¬ 
composition  techniques  in  the  course  of  their  work  on  the  numerical  modeling  of  steady,  compress¬ 
ible  inviscid  flow  and  of  unsteady,  incompressible  viscous  flow  (see  [16]  and  the  references  therein). 
Both  of  these  subrealms  of  the  Navier-Stokes  equations  may  be  operator-split  and  discretized  in 
such  a  way  that  the  subtasks  of  greatest  computational  complexity  are  scalar  problems  of  Poisson 
type  with  a  large  number  of  degrees  of  freedom  and  wide  variability  in  the  coefficients  in  irregular 
geometry.  We  briefly  describe  here  three  algorithms  of  conjugate  gradient  type  presented  in  [10. 
16]  which  enter  into  the  comparisons  in  §3.  Like  the  Schur  complement  methods,  all  of  these  meth¬ 
ods  iterate  on  degrees  of  freedom  at  the  boundaries  of  the  subdomains  only,  with  each  iteration 
requiring  one  or  two  exact  interior  solves  per  subdomain.  Two  of  the  methods  use  a  partitioning  of 
the  domain  into  subdomains  with  non-intersecting  interiors,  as  in  Figures  1  and  2.  The  remaining 
method  is  related  to  the  Schwartz  alternating  procedure  in  that  the  subdomains  overlap  in  regions 
of  nonzero  measure. 

The  algorithms  are  summarized  below  in  matrix  operator  form,  without  derivation,  for  the 
model  Poisson  problem  (2.1b).  It  suffices  to  identify  the  vectors  in  Algorithm  PCG  in  terms  of  the 
physical  variables  and  to  specify  the  operators  A  and 

The  first  two  algorithms  are  based  on  the  equivalence  of 

-Ay  =  /  in  n. 
y  =  0  on  r, 

and  the  problem 

y  =  arg  min  {-  /  |  [^  dx - 

z€H^{n)  2  Jq 

Upon  partitioning  the  domain  and  carrying  out  the  minimization  on  each  of  the  subdomains  sepa¬ 
rately,  either  the  normal  derivative  or  the  trace  of  the  solution  along  the  interface  can  be  regarded 
as  the  principal  unknown,  and  its  value  iteratively  updated  until  continuity  of  the  other  is  satisfied. 
Through  intermediate  saddle-point  formulations  dual  conjugate  gradient  algorithms  we  derived. 

In  the  first  algorithm  (§4.1  of  [lO]  or  §2.2.2  of  [16]),  the  unknown  vector  represents  the  discrete 
normad  derivative  of  the  solution  along  the  interface  adjusted  for  sign,  x  =  (-l)*“^5u*/9nt,  A  is 
the  matrix  (C^^))"^  -+-  (C^^))“^,  and  B  is  the  matrix  defined  by  (for  k  =  1  or  k  =  2) 

[B],;  =  h~^  /  ds,  ij  E  No- 

-'KI2 

In  the  uniform  mesh  case.  B  =  I  —  \K.  This  method  we  denote  by  Saddle- 1. 


Figure  5:  Surface  plots  of  the  elements  of  the  Schur  com¬ 
plement  matrix  C  for  a  nonseparable  operator  on  a  square 
uniformly  discretized  with  16  subintervals  on  a  side,  and  the 
first  three  MSC  preconditioners.  (a)  C  itself,  (b)  A/s(o).  (c) 
A/5(i).  (d)  Ms[2)- 


■wv.W"'. 


Figure  6:  Sample  domain  illustrating  the  overlap  decomposi¬ 
tion  (subdomains  staggered  for  clarity). 


In  the  second  algorithm  (§4.2  of  [10]),  denoted  Saddle-2,  the  unknown  vector  represents  the 
solution  itself  at  the  interface  nodes,  A  is  precisely  the  Schur  complement  C,  and  B  is  the  matrix 
defined  by 

2  p 

lB]ij  =  ^  /  •  VV'fc,’  dx,  ij  €  No- 

k^l  “'0* 


In  the  uniform  mesh  case  B  —  These  preconditioners  commute  with  their  respective 

^-matrices,  the  spectra  of  which  can  be  constructed  from  the  eigendecompositions  given  in  §§2.2. 
It  is  readily  verified  that  in  both  cases  k(B~^A)  grows  asymptotically  in  proportion  to  l/k. 

The  other  method  (§2.3.5  and  §2.3.7  of  [16])  uses  a  decomposition  with  overlapping  subdo¬ 
mains,  as  illustrated  for  a  two  subdomain  case  in  Figure  6.  The  overlap  region  is  denoted  by  Qu 
and  distinct  internal  subdomain  boundaries  71  and  72  are  defined  as  shown.  We  denote  this  method 
by  CG-Schwarz  (or  CGS). 

The  unknown  vector  represents  the  trace  of  the  solution  along  71  U  72.  and  is  partitioned 
accordingly,  u  =  (ui,  U2)^  .  For  the  case  of  the  Laplacian,  the  algorithm  is  based  on  the  equivalence 
of 

—Ay  =  /  in  n, 


y  =  0  on  F, 


and  the  problem 


u  =  (ui,U2)  =  arg  min  {i  /  |V(y2  -  yi)|^ -I- |y2  -  yi[’^  dx}, 

veVixVj  2  ./n  L 


where  the  y/t  €  are  the  solutions  of 

-Ay*  =  /*  in  fl*, 
y*  =  0  on  dfi/t  D  F, 
y*  =  v*  on  7t, 


and  where  V*  is  the  space  consisting  of  the  traces  on  7*  of  functions  in  H^(Qic)  which  vanish  on 
dQi2  n  r,  for  k  =  1,2. 

We  introduce  the  discrete  subspace  of  H^(Cli2)  consisting  of  piecewise  linear  functions 

vanishing  on  5ni2nr.  but  not  necessarily  on  71  or  72,  and  we  denote  a  basis  for  it  by  {xj},j  €  A’12. 
Let  A’oi  and  A'02  be  index  sets  for  the  nodes  along  71  and  72,  respectively. 

The  action  of  on  a  vector  x  =  (a:i,X2)^  is  implicitly  defined  as  follows.  First,  the  pair  of 
subdomain  Dirichlet  problems  with  boundary  data  a:*,  on  the  7*., 


■A-tyk  =  -CkXk, 


is  solved  for  the  j/*,,  where 


[^k]i}  =  [  ■  ^^kki  dx,  i,j  e  Nk 

[Cikliy  =  /  VV'ij  ■  VV’fci  dx,  i  €  Nk,j  €  Nok- 


Then  the  difference  of  the  y*  over  the  region  of  common  definition, 

^Vk  =  yitln,2  -  yi\ni2^ 

is  computed,  where  I  is  the  complement  of  k  in  {1,2}.  Then  another  pair  of  subdomain  Dirichlet 
problems  with  forcing  due  to  the  difference  in  the  y^  over  the  overlap  region, 


AkZk  =  Fk6yk, 


is  solved  for  the  2*.  where 


where 


=  f  [^X7  ■  +  Xj^ki]  dx,  i  €  NkJ  €  A^i2. 

Finally, 

\D26y2-E2Z2)' 

where 

[■Dfclo  =  /  [Vx;  •  V0*,-  +  Xji’ki]  dx,  i  €  NokJ  €  A’i2 
J012 

and 

[Ek]ij  =  [  ^i’kj  ■  '^'*i’ki  dx,  I  e  Aot,;  €  Nk- 
Jn* 

We  may  consider  different  preconditioners,  which  are  of  the  form  B  =  diag{Bi,  82)-  In  §2.3.5 
of  [16],  an  essentially  unpreconditioned  form  of  the  algorithm  is  proposed  in  which  S*  has  the 
interfacial  line  integral  form  of  B  in  the  Saddle- 1  method.  In  the  experiments  in  §3  we  in  fact 
report  on  B*  =  N  In  §2.3.7  the  preconditioner 

{Bk]ij  =  f  '^tkkj  •  ^^ki  dx,  ij  e  A"o* 
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is  used.  Though  not  mentioned  in  [16].  we  also  consider  S*  =  A' in  the  numerical  experiments. 
Note  that,  like  the  PMM,  the  overlap  decomposition- based  method  requires  two  Dirichlet  problems 
to  be  solved  in  each  subdomain  per  iteration. 

2.6.  Generalization  to  Multiple  Nonintersecting  Interfaces 

The  two-subdomain  case  may  possess  genuine  interest  for  many  problems  of  physical  origin, 
but  it  is  of  limited  interest  when  implementation  of  domain  decomposition  algorithms  on  ensemble 
architectures  with  a  large  number  of  processors  is  considered.  The  multiple  subdomain  case  has 
been  studied  by  Drj'ja  and  Proskurowski  in  [ll]  and  [12],  in  which  serial  implementations  making 
use  of  up  to  32  subdomains  have  been  tested,  and  by  Chan  and  Resasco  in  [7].  The  multiple  strip 
case  can  be  accommodated  by  our  notation  of  (2.3)  and  (2.4)  by  letting  uq  represent  all  of  the 
separator  set  unknowns,  ordered  by  interface,  and  letting  uj  represent  all  of  the  interior  unknowns, 
ordered  by  subdomain.  If  G  is  uniformly  triangulated,  and  if  there  are  p  subdomains  of  equal  size 
with  no  interior  gridpoints  along  each  interface  and  m  gridpoints  across,  then  Aoo  is  a  block  matrix 
with  p  -  1  diagonal  blocks  of  size  no,  and  An  is  a  block  matrix  with  p  diagonal  blocks  of  size  nom. 
Under  our  assumptions  on  A,  the  Schur  complement  (which  is  block  tridiagonal  with  blocks  of  size 
no)  is  symmetric  and  positive  definite,  and  the  following  is  proved  in  [ll]: 

Theorem  2.2.  Let  the  Schur  complement  C  be  defined  as  following  (2.4)  and  above,  and  let  Mp  be 
defined  as  the  (p  —  1)  x  (p  —  1)  block  diagonal  matrix  with  no  x  no  diagonal  blocks  Mp-  Then,  for 
all  X, 

w')o{x,Mpx)  <  (x,Cx)  <  w~^')i(x,Mpx)  ,  (2.10) 

where  70  and  71  are  positive  constants  independent  of  h,  and  w  is  the  minimum  of  the  strip  widths. 

This  makes  Afp  an  optimal  preconditioner  for  C  for  a  given  decomposition,  ajid  similar  block- 
diagonal  extensions  can  be  made  for  the  other  single-interface  preconditioners.  The  analogous 
block  preconditioner  Mp  is  also  tested  in  [ll].  The  reference  [12]  considers  a  multiple  subdomain 
generalization  of  the  alternating  Neumann-Dirichlet  method,  first  presented  for  the  two-strip  case 
in  [2].  Finally,  [7]  presents  an  exact  eigendecomposition  of  the  multiple-strip  Schur  complement 
matrix  which  generalizes  Chan’s  method  [6]  to  a  multiple-strip  domain-decomposed  fast  Poisson 
solver. 

Note  from  (2.10)  that  the  block  preconditioning  of  C  by  Mp  suffers  as  the  strips  become  thin, 
that  is  as  the  number  of  subdomains  is  increased  for  fixed  domain  geometry.  In  [7]  this  is  traced 
to  ignoring  the  off-diagonal  blocks  of  C.  By  means  of  the  exact  eigendecomposition,  these  blocks 
may  be  shown  to  be  insignificant  in  the  limit  of  large  aspect  ratio  subdomains  (“thick”  strips), 
but  more  and  more  significant  in  the  limit  of  small  aspect  ratio  subdomains.  Since  w  a  p”*, 
the  bound  on  the  condition  number  from  (2.10),  7i/(7ow^).  increases  like  p*.  The  inability  of 
stripwise  decompositions  to  accommodate  large  numbers  of  subdomains  without  a  deterioration  of 
convergence  is  a  major  weakness  when  it  comes  to  large-scale  parallel  implementations.  A  more 
implicit  means  of  handling  the  interfaces  is  required. 


2.7.  Generalization  to  Intersecting  Interfaces 

An  extension  to  decompositions  in  which  interfaces  intersect  in  vertices  or  crosspoints,  resulting 
in  the  formation  of  boxes  instead  of  strips,  is  given  by  Bramble  et  al  in  [4];  this  development  is 
in  fact  the  focus  of  their  paper.  (The  single  interface  version  of  the  PMM  presented  in  §§2.3  is  a 
special  case.)  Constraints  of  space  and  focus  prohibit  the  full  development  of  their  technique  here, 
but  a  brief  account  of  its  implications  for  the  discrete  version  of  the  problem  is  furnished  below. 

The  key  step  is  the  further  decomposition  in  function  space  of  the  discrete  harmonic  component 
of  u,  .  Recall  from  §§2.3  that  u  =  ,  where  in  each  subdomain  k,  i/^j*  vanishes  on  dQ^ 

and  satisfies 
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=  An*(u.7') 

for  all  i'  €  -ffoi/,;  and  satisfies  =  u  on  and 

=  0 

for  all  V  E  -ffojtA-  +  u'  ,  where  in  each  subdomain  /.*.  u'  |/^  and  u^|*.  each 

satisfy 

^n*(w'  <t’)  =  An*(u^.r)  =  0 

for  all  t'  6  -ffofc/i;  linear  along  the  edges  of  dClt  and  and  agrees  with  v  at  the  vertices,  and 

vanishes  at  the  vertices.  A  finite  element  basis  for  u  is  then  constructed  which  consists  of  the 
usual  C°  piecewise  linear  basis  functions  of  smallest  possible  support  defined  at  the  nodes  interior 
to  each  Qk  and  at  the  nodes  along  each  edge  (except  at  the  intersections  of  the  edges),  and  a  special 
set  of  basis  functions  defined  at  the  crosspoints  with  support  extending  out  along  the  edges  which 
connect  the  crosspoints.  These  special  basis  functions  each  vanish  at  all  crosspoints  but  one  and  at 
all  interior  nodes,  and  are  linear  along  the  edges.  The  Galerkin  equations  involving  the  crosspoint 
degrees  of  freedom  thus  possess  a  nonlocal  connectedness.  If  the  nodal  values  of  u  are  permuted  as 
before  so  that  uo  represents  the  separator  nodes  (edges  and  crosspoints),  and  uj  the  interior  nodes, 
then  the  computation  of  uf  and  proceed  as  before,  as  independent  Dirichlet  solves  over  each 
subdomain.  However,  the  calculation  of  uo  from 

Muq  =  fo-  Aoiuf 

proceeds  in  another  series  of  independent  solves  as  follows.  M  is  a  block  diagonal  matrix  with 
a  diagonal  block  corresponding  to  the  nodes  along  each  edge.  These  blocks  are  each  essentially 
matrices  of  the  form  Mp.  leaving  scaling  considerations  aside.  The  final  block  corresponds  to 
a  diagonally  dominant  difference  equation  for  the  internal  crosspoint  nodes  and  has  a  sparsity 
structure  identical  to  a  graph  of  the  decomposition:  each  edge  connecting  one  vertex  i  to  another 
vertex  j  contributes  a  nonzero  to  the  tth  row  in  column  j.  For  a  decomposition  of  the  unit  square 
into  uniform  square  subdomains,  this  matrix  is  simply  the  discrete  Laplacian,  to  within  a  scaling 
factor.  The -right-hand  side  involves  inner  products  between  the  interfacial  nodal  coefficient  vectors. 
The  reader  is  referred  to  [4]  for  the  complete  details.  We  conclude  by  paraphrasing  Theorem  1 
and  Remark  2.6  therein  as  follows,  in  which  it  is  assumed  that  the  underlying  triangulation  and 
decomposition  into  subdomains  are  quasi-uniform  of  sizes  h  and  d.  respectively. 

Theorem  2.3.  Let  the  matrix  A  be  deSned  as  in  (2.3)  and  above,  and  let  B  be  defined  as  in  (2.7), 
where  M  is  the  separator  set  preconditioner  of  [4j.  Then,  for  all  x, 

7o{a:,Bx)  <  (i,  Ai)  <  7i(i,  Bi)  .  (2.11) 

where  -70  and  71  are  positive  constants  such  that  for  some  positive  constant  cj, 

J<ci(l  +  ln(^)2). 

70  h 

If,  instead,  the  vertex  difference  equation  block  of  M  is  replaced  by  a  weighted  identity  operator 
then  for  some  positive  constant  cj. 

—  <  C2d~^(l  +  ln(^)*)  . 
lo  n 


The  implications  of  this  theorem  for  the  many  subdomain  decomposition  are  illustrated  by 
the  following  special  case.  Consider  a  square  domain  uniformly  discretized  with  n  subintervals 
on  a  side  and  uniformly  decomposed  into  p  square  subdomains  of  nj ^  subintervals  on  a  side. 
Then  d  oc  ^  and  djh  =  .  Therefore,  by  Theorem  2.3.  when  the  crosspoints  are  treated 

implicitly  the  condition  number  is  bounded  by  Ci(l  +  ln(n/.^)^):  and  when  they  are  decoupled  the 
condition  number  is  bounded  by  a  C2p(l  +  ln(n/.yp)2).  In  the  former  case  the  condition  number  is 
bounded  by  a  constant  if  subdomains  are  introduced  as  required  with  increasing  resolution  to  keep 
the  number  of  subintervals  on  a  subdomain  side  fixed.  In  the  latter  case,  the  condition  number 
grows  in  proportion  to  the  number  of  subdomains. 

A  simpler  alternative  for  handling  the  crosspoints  in  the  decoupled  case  is  to  employ  the  usual 
cardinal  finite  element  basis  for  all  of  the  interior  degrees  of  freedom,  including  the  crosspoints, 
along  with  a  weighted  identity  operator  for  the  corresponding  block  of  M.  This  method  requires 
no  special  computations  to  form  the  right-hand  side  of  the  crosspoint  system.  The  examples  of  §3 
labeled  “PMM  without  vertex  coupling’"  employ  this  simpler  alternative. 

3.  Experimental  Comparisons  of  Domain  Decomposition  Techniques 

The  methods  described  in  section  2  were  tested  on  a  variety  of  model  problems  chosen  to 
reveal  their  relative  strengths  and  weaknesses.  Fc^r  of  the  problems  we  consider  ao'e  posed  on  the 
unit  square,  symmetrically  divided  into  two  strips  or  four  boxes  by  straight  segments  bisecting 
opposite  sides.  The  operators  include  the  cases  of  uniform,  discontinuous,  and  smoothly  varying 
nonseparable  coefficients.  The  other  problems  are  posed  in  a  small  aspect  ratio  rectangle,  and  in  a 
T-shaped  region. 

These  tests  were  carried  out  on  a  \’AX/785  in  single  precision  (24-bit  mantissa)  using  the 
experimental  interpretive  language  CLAM  (Conversational  Linear  Algebra  Machine).  By  providing 
a  convenient  symbolic  interface  to  the  LINPACK  and  EISPACK  libraries,  CLAM  allowed  relatively 
quick  “breadboarding’'  of  the  various  algorithms.  The  tests  are  not  as  comprehensive  as  some  of 
those  available  in  the  literature  for  individual  problems  and  methods  (e.s.,[l,  4,  11.  12])  because  of 
size,  speed,  and  precision  limitations,  nor  do  they  embrace  as  wide  a  scope  of  domain  geometries  or 
operators.  The  chief  value  of  these  results  is  tutorial,  in  that  they  permit  algorithmic  comparison 
on  a  common  set  of  problems  from  common  initial  iterates  with  common  convergence  criteria  and 
measures,  all  of  which  may  vary  or  be  left  unstated  from  paper  to  paper. 

3.1.  Conventions  in  conducting  tests  and  reporting  results 

To  complete  the  specification  of  Algorithm  PCG  of  §§2.1.  which  is  the  basis  for  all  of  the 
experiments,  the  initial  iterate  is  always  taken  to  be  zero  everywhere,  and  convergence  is  based  on 
the  relative  size  of  the  unnormalized  Euclidean  norm  of  the  true  residual.  Since  these  practices 
are  not  uniform  in  the  testing  of  such  algorithms,  we  comment  briefly  upon  them.  In  practical 
applications  of  PCG,  the  most  conveniently  monitored  convergence  criterion  is  the  norm  of  the 
preconditioned  residual,  (r,  S~*r)2,  since  the  square  of  this  quantity  is  already  required  to  advance 
the  algorithm.  However,  when  comparisons  across  a  variety  of  preconditioners  B  are  carried  out, 
the  convergence  criterion  should  not  be  affected  by  the  type  of  preconditioner.  Therefore  the  norm 
of  the  unpreconditioned  residual  was  calculated  at  every  iteration  of  the  serial  tests,  as  well.  We 
declare  convergence  when  (r*,r*)^  <  £(r°,r°)2.  as  opposed  to  the  absolute  criterion  used  in  some 
of  the  early  domain  decomposition  literature,  in  order  to  remove  dependence  on  the  resolution  of 
the  problem.  In  our  test  problems  the  shape  of  the  right-hand  side  /  is  generally  such  that  its 
domain  average  decreases  by  an  appreciable  factor  as  the  mesh  is  refined  over  several  powers  of  two 
and  points  further  from  the  symmetrically  located  maximum  of  /  are  brought  in.  This  is  reflected 
in  a  reduced  initial  residual,  since  the  initial  iterate  is  zero,  and  fewer  iterations  are  required  to 
reach  an  absolute  residual  tolerance.  If  anything,  we  should  insist  that  an  iterative  method  work 
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harder  as  the  resolution  increases  in  order  to  realize  the  potential  of  the  lower  truncation  error  of 
the  discretization,  otherwise  the  additional  resolution  is  wasted  from  the  continuous  viewpoint.  As 
a  minimum  requirement,  we  insist  that  the  residual  be  decreased  by  a  constant  factor.  In  all  of  the 
tables  to  follow,  e  =  10"'*. 

The  principal  results  that  we  report  for  each  experiment  are  the  number  of  iterations  to  con¬ 
vergence.  1.  the  estimated  condition  number  of  the  preconditioned  system,  k.  and  the  average 
reduction  per  iteration  of  the  residual,  p.  These  are  complementary  indications  of  the  rate  of  con¬ 
vergence  of  the  method.  As  the  only  one  of  these  three  quantities  which  depends  soleh'  upon  the 
operator,  the  true  condition  number  is  an  attractive  measure.  However,  a  large  condition  number 
ma>'  lead  to  an  overly  pessimistic  appraisal  of  a  method  if  the  eigenvalues  are  unevenly  distributed. 
The  iteration  count  is  the  most  useful  “^bottom  line”  convergence  measure  for  use  in  conjunction 
with  cosT-per-iteration  complexity  estimates  to  determine  overall  algorithmic  complexity.  How¬ 
ever.  this  is  convergence  criterion-dependent.  The  average  rate  of  reduction  per  iteration,  defined 
by  [{r^ ,r^)/[r^ where  I  is  the  number  of  iterations  to  convergence,  is  a  more  reliable 
estimator  of  the  convergence  rate  than  the  condition  number,  without  being  as  strongly  dependent 
upon  the  particular  convergence  criterion  as  /,  and  it  does  not  exhibit  the  threshold  effect  that  I 
does.  However,  p  still  retains  a  dependence  on  the  initial  iterate.  Since  none  of  these  three  measures 
is  ideal,  we  report  all  of  them. 

The  condition  number  estimate  in  our  tables  is  obtained  as  a  by-product  of  the  PCG  iterations 
by  the  method  of  Lanczos  [20]  at  a  small  computational  overhead.  For  each  k  in  the  loop  {PCG.6)- 
(PCG.14).  the  diagonal  and  subdiagonal  elements  of  a  symmetric  tridiagonal  matrix  are  formed 
according  to 


Qfk-l 


Sk+i 


respectively.  The  extreme  eigenvalues  of  this  matrix  often  accurately  approximate  those  of  B~^A 
even  if  terminated  with  k  considerably  less  than  n. 

Finally,  we  note  that  although  the  finite  element  formulation  has  been  chosen  throughout  this 
paper  in  order  to  provide  a  uniform  theoretical  background  for  the  different  methods,  the  results  in 
Tables  5,  6,  and  7  were  actually  generated  using  the  standard  second-order  finite  difference  method 
to  construct  the  discrete  A  and  A.  (These  alternative  formulations  do  not  generally  coincide  at  the 
discrete  level  when  the  operator  is  not  piecewise  constant.) 


3.2.  Stripwise  decomposition  tests 

The  first  test  problem  is  Poisson's  equation 


=  /  , 

in  the  unit  square  divided  symmetrically  into  two  strips,  where  /  is  chosen  so  that  u  =  16xj/(l  — 
x)(l  -  y).  This  problem  is  essentially  the  same  as  the  first  example  in  [ll.  12).  It  may  be  solved 
conveniently  by  a  SCM.  Surface  plot  comparisons  of  the  preconditioners  in  Table  1  with  C  were 
given  in  Figure  3. 

For  this  domain  geometry,  and  for  all  but  one  to  follow,  the  Schur  complement  matrix  pre¬ 
conditioners  Me  and  Mb  are  identical,  so  there  is  no  need  for  separate  tabulation.  The  optimalit\’ 
properties  of  Me,  Me-  and  Mp  are  in  evidence,  with  the  number  of  iterations  independent  of  prob¬ 
lem  size.  As  the  mesh  is  refined,  p  actually  shows  a  slight  improvement  for  Me  and  Mq-  For  any 
meshes  but  the  coarsest,  the  MSC  preconditiouers  are  inferior,  and  exhibit  steadily  deteriorating 
p.  The  last  column,  for  which  B  =  I.  shows  the  proportionality  of  the  condition  number  to  the 
number  of  degrees  of  freedom  in  the  unpreconditioned  system. 
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1 

2 

!  K 

1.000 

1.094 

1  p 

l.l(-6) 

l.l(-3) 

2 

3 
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Table  1:  =  /  on  the  unit  square,  divided  symmetrically 

into  two  strips.  Results  for  SCM  as  a  function  of  problem 
size  and  interface  preconditioner. 
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5 

4 
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1.484 

1.3(.l) 
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3 

1.072 
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Table  2;  V  (aVu)  =  0  on  the  unit  square,  divided  symmet¬ 
rically  into  two  strips,  where  a  =  1.0  in  one  subdomain  and 
0.1  in  the  other.  Results  for  SCM  as  a  function  of  problem 
size  and  interface  preconditioner. 


The  second  problem  involves  a  physical  interface  which  coincides  with  the  subdomain  boundary. 
The  equation  is: 

V  (aVu)  =  0 

where  the  diffusion  coefficient  a  is  1.0  in  one  of  the  subdomains  and  0.1  in  the  other,  and  where 
u  obeys  the  Dirichlet  condition  u  =  xy  on  the  boundary  of  the  unit  square.  Again,  a  SCM  is 
appropriate  for  this  problem. 

The  optimal  methods  continue  to  perform  well  in  spite  of  the  jump  discontinuity,  although 
p  no  longer  shows  improvement  with  increasing  problem  size.  The  MSC  methods  fare  slightly 
better  than  in  the  featureless  case,  but  still  cannot  compete  with  the  optimal  methods  on  so  ideal 
a  problem. 


Table  3:  =  /  on  a  low  aspect  ratio  rectangle,  divided 

asymmetrically  into  two  strips.  Results  for  SCM  as  a  function 
of  problem  size  and  interface  preconditioner. 
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2.260 
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Table  4:  =  /  on  an  asymmetric  T-shaped  region,  divided 

into  two  strips.  Results  for  SCM  as  a  function  of  problem  size 
and  interface  preconditioner. 

The  third  problem  is  again  Poisson’s  equation,  a  compressed  version  of  the  first  problem  posed 
in  the  rectangle  (0,1)  x  (0,  |),  with  an  interface  at  y  =  5  and  /  chosen  so  that  u  =  - 

^)y(l  -  y)- 

The  subdomain  aspect  ratios  in  this  problem  are  more  typical  of  those  in  multistrip  decompo¬ 
sitions  of  domains  of  roughly  unit  aspect  ratio.  This  is  the  only  problem  to  exhibit  a  distinction 
between  Me,  which  is  again  exact,  and  Afg. 

As  our  only  example  of  a  geometry  which  is  unsuitable  for  an  undecomposed  fast  Poisson  solver, 
we  consider  the  asymmetric  T-shaped  domain  problem  from  [l].  The  domain  is  the  unit  square  with 
two  rectangles  removed,  as  shown  to  scale  in  Figure  lb.  The  equation  is  Poisson’s  equation  with 
inhomogeneous  Dirichlet  boundary  data  and  right-hand  side  chosen  so  that  u  =  +  —  xe*  cos  y. 

Though  the  averall  geometry  is  nonseparable,  the  subdomains  can  each  be  handled  by  a  fast  Poisson 
solver,  so  a  SCM  is  appropriate. 

The  optimal  methods  have  no  trouble  handling  geometrical  irregularity  such  as  this.  The  MSC 
methods  are  more  competitive  than  before  due  to  the  spoiling  of  symmetry  which  prevents  the  use 
of  any  exact  preconditioning,  but  they  continue  to  be  inferior  for  sufficiently  fine  resolution. 

The  fifth  problem  involves  generalization  in  another  direction:  a  nonseparable  operator.  We 
consider  ELLPACK  problem  #1  [25]: 

V  •  (aVu)  +  Kv  =  f  , 

where  an  =  e*'*',  022  =  an  =  021  =0,  Jv  =  -(1  -I-  x  -f  and  where  /  is  chosen  so  that 
V  =  |e**'sin(jrx)sin(7ry).  Contour  plots  of  am  022  and  K  appear  in  Figure  7. 

For  this  problem,  we  use  a  PMM  and  present  results  for  two  different  forms  of  the  parti¬ 
tioned  matrix  preconditioner  B  in  two  separate  tables.  The  condition  number  estimates  pertain 


Figure  7:  Contour  plots  of  the  coefficients  of  the  fifth  and 
sixth  test  problems,  with  nonseparable  operators,  (a)  an, 
(b)  022,  (C)  -K,  (d)  a. 
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2.2(-l) 

2.2(-l) 
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3.244 
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3.0(-l) 
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3.2(-l) 
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3.1(-1) 

3.0(-l) 
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3.917 

3.892 

4.041 

4.565 

3.992 

4.034 

3.917 

! 

p 

3.6(-l) 

3.6(-l) 

3.6(-l) 

4.0(-l) 

3.8(-l) 

3.7(-l) 

3.6(-l) 

Table  5:  V  (aVu)  +  ku  =  /  on  the  unit  square,  divided 
symmetrical!}’  into  two  strips,  where  an  =  e**',  022  =  e"**', 
ai2  =  021  =  S’lid  K  =  (1  +  3:  +  y)~^,  with  subdomain 
preconditioning  by  the  Laplacian.  Results  for  PMM  as  a 
function  of  problem  size  and  interface  preconditioner  and  for 
GMM  as  a  function  of  problem  size. 


to  k{B~^A),  where  B  is  constructed  according  to  (2.7),  by  using  the  different  M-blocks  shown  at 
the  head  of  the  table  columns.  Results  for  the  preconditioning  arising  from  taking  A  of  (2.7)  to 
be  the  Laplacian  operator  are  shown  in  Table  5.  This  is  a  rather  coarse-grained  preconditioner, 
which  does  not  take  advantage  of  the  sensitivity  of  the  domain  decomposition  technique  to  local 
operator  properties.  We  also  consider  a  more  conventional  use  of  the  PCG  technique,  the  GMM  of 
preconditioning  a  nonseparable  problem  by  one  which  can  be  implemented  by  a  fast  Poisson  solver 
over  the  entire  domain.  The  results,  displayed  in  the  last  column,  are  identical  to  those  of  Me, 
which  is,  of  course,  the  decomposed  equivalent  of  the  Laplacian. 

The  problem  is  a  difficult  one  for  all  of  the  methods,  as  evidenced  by  the  fact  that  for  the 
coarsest  mesh  case,  the  number  iterations  for  all  methods  is  fully  equal  to  the  number  of  unknowns 
on  the  interface.  (Of  course,  we  are  now  iterating  on  vectors  with  length  0{h~^),  but  recall  that 
for  the  first  four  problems  considered,  the  PMM  and  SCM  implementations  have  identical  iteration 
counts  much  lower  than  the  number  of  interfacial  unknowns.)  For  problems  of  this  difficulty,  the 
MSC  preconditioners  do  not  fare  appreciably  worse  than  the  optimal  ones,  ajid  achieve  comparable 
condition  numbers  even  at  the  moderate  resolutions  investigated. 

Table  6  shows  the  improvement  possible  by  generating  the  preconditioner  blocks  A*  by  subdo¬ 
main  averaging  the  original  operator  A,  giving  a  PMM  finer  granularity  than  a  GMM.  Let  (aii)y(i) 
and  {K)y{x)  denote  subdomain  averages  of  the  respective  coefficients  with  respect  to  y  and  (022)1(5/) 
and  {K)z{y)  the  same  with  respect  to  x.  In  each  subdomain  k.  A*  has  the  separable  form 


l^).- 


=  /  (( 
Jn, 


+  (a22)x^^  +  dx. 


With  the  local  averaging,  the  best  of  the  optimal  PMMs  begin  to  break  away  from  the  GMM 
even  though  only  two  subdomains  are  employed.  A  more  graphic  illustration  of  the  improvements 
offered  by  the  PMM  in  the  many  subdomain  case  is  given  in  example  3  of  [4]. 

Our  final  stripwise  decomposition  example  is 


V  .  (aVu)  =  / 

on  the  unit  square,  where  a  =  1  -H  6tan“*(x  —  5)  +  ctan"^(d(y  —  ^)).  A  contour  plot  of  a  with 
b  =  0.65,  c  —  0.35,  and  d  =  10.0  appears  in  Figure  7,  and  surface  plots  of  the  actual  Schur 
complement  matrix  and  various  MSC  approximations  thereto  were  given  in  Figure  5. 


t'age  29 


“FT- 

Me 

Mg 

Md 

H99 

GMM 

8 

a 

■H9 

■H9 

■99 

■99 

■99 

999 

mU 

H 

■wwi 

999 

99 

■B9 

16 

■■ 

5 

5 

6 

6 

6 

5 

5 

1.517 

1.515 

2.088 

1.898 

1.543 

1.661 

■9 

1.3(-1) 

1.3(-1) 

1.9(-1) 

2.0(-l) 

1.5(-1) 

1.3(-1) 

1.5(-1) 
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2.637 
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1.4(-1) 

1.4(-1) 

1.9(-1) 

2.9{-l) 

2.6(-l) 

2.0(-l) 

1.8(-1) 

Table  6:  V  (aVu)  +  kv  =  f  on  the  unit  square,  divided 
symmetrically  into  two  strips,  where  an  =  e**',  022  =  e"**', 
ai2  =  021  =  0^  and  K  =  (l+i+j/)"’,  with  subdomain  precon¬ 
ditioning  by  a  locally  averaged  separable  variable-coefficient 
operator.  Results  for  PMM  as  a  function  of  problem  size 
and  interface  preconditioner  and  for  GMM  as  a  function  of 
problem  size. 
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Table  7:  V  •  (aVu)  =  /  on  the  unit  square,  divided  sym¬ 
metrically  into  two  strips,  where  a  =  1  6tan~^(x  ~  5)  + 
ctan”*(d(y  -  5)),  with  subdomain  preconditioning  by  a  lo¬ 
cally  averaged  separable  variable-coefficient  operator.  Re¬ 
sults  for  PMM  as  a  function  of  problem  size  and  interface 
preconditioner  and  for  GMM  as  a  function  of  problem  size. 

A  separable  subdomain-averaged  PMM  preconditioner  was  constructed  in  the  manner  de¬ 
scribed  above.  In  this  example,  the  results  of  which  appear  in  Table  7,  all  of  the  PMMs  are  better 
than  or  comparable  to  the  GMM. 

3.3.  Boxwise  decomposition  tests 

We  carried  out  serial  tests  of  the  two  decomposition  strategies  involving  crosspoints  on  the 
first  problem  of  the  previous  subsection,  namely  Poisson's  equation  on  the  unit  squsje.  Table  8 
contains  the  results  for  the  uncoupled  crosspoint  version  of  the  algorithm,  and  Table  9  contains  the 
results  for  the  coupled  crosspoint  version.  Since  there  is  only  one  crosspoint  in  this  example,  the 
two  algorithms  differ  only  in  the  assembling  of  the  right-hand  side  of  the  crosspoint  equation,  as 
described  in  §§2.7. 

Even  though  there  is  only  a  single  crosspoint,  the  difference  between  employing  local  and 
nonlocal  basis  functions  for  the  crosspoint  degree  of  freedom  is  evident  in  comparing  the  tables. 
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Table  8:  =  /  on  the  unit  square,  divided  into  four  equal 

boxes.  Results  for  PMM  without  vertex  coupling,  as  a  func¬ 
tion  of  problem  size  and  interface  preconditioner. 
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Thble  9:  V*u  =  f  on  the  unit  square,  divided  into  four  equal 
boxes.  Results  for  PMM  with  vertex  coupling  after  Bramble 
et  al.,  as  a  function  of  problem  size  and  interface  precondi¬ 
tioner. 

The  condition  numbers  are  uniformly  smaller  in  the  nonlocal  case,  though  the  overall  number 
of  iterations  is  not  greatly  affected.  The  growth  of  condition  number  for  Dryja’s  interface  pre¬ 
conditioner  with  crosspoint  coupling  (the  third  column  of  Table  9)  follows  the  theoretical  bound 
ci(l  -t-  \a{nly/p)'^)  with  a  c\  of  approximately  1. 

A  word  is  in  order  concerning  the  construction  of  the  MSC  preconditioners  in  the  case  of 
multiple  interfaces.  We  require  that  M  be  block-diagonal  as  with  the  other  preconditioners,  to 
preserve  the  independence  of  the  interfacial  solves.  In  order  to  avoid  solving  four  Dirichlet  problems 
on  each  subdomain  for  each  degree  k  of  approximation  in  MSC(/c),  we  economize  by  varying  the 
components  of  the  vectors  v,  of  (2.9b)  on  all  interfaces  simultaneously,  rather  than  treating  the 
interfaces  one-by-one.  The  algorithm  described  in  §§2.4  is  then  applied  block-by-block  to  generate 
the  coefficients  of  the  interfacial  blocks  of  Ms^|cy  The  vectors  u,  always  have  zeros  corresponding 
to  the  crosspoint  location  and  the  crosspoint  block  is  unaffected  by  the  choice  of  interfacial  blocks. 
As  in  the  case  of  Poisson’s  equation  on  two  strips,  the  MSC  methods  cannot  compete  with  the 
optimal  methods. 

3.4.  Variational  methods  tests 

The  methods  of  §§2.5  were  applied  to  the  problem  of  Table  1.  A  two-strip  decomposition  was 
used  for  the  saddle-point  methods.  The  decomposition  pictured  in  Figure  6,  with  an  overlap  region 
which  spanned  one  subinterval  on  either  side  of  the  horizontal  bisector  of  the  square,  was  employed 
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Table  10:  V^u  =  /  on  the  unit  square,  divided  symmetri¬ 
cally  into  two  strips,  with  and  without  overlap.  Results  as  a 
function  of  problem  size  and  method/preconditioning  combi¬ 
nation. 

for  the  CG-Schwarz  methods.  (Thus,  the  width  of  the  overlap  region  was  allowed  to  shrink  wdth 
increasing  resolution.)  The  results  are  given  in  Table  10. 

We  note  that  the  preconditioning  is  effective  on  the  CG-Schwarz  method,  the  spectrum 
of  the  discrete  operator  A  of  which  has  not  been  analyzed  to  date.  However,  it  is  apparent  by 
comparison  with  Table  1  that  on  such  simple  problems  these  methods  are  inferior  to  SCM  with 
optimal  preconditioning. 


4.  Parallel  experiments  with  domain  decomposition  algorithms 

In  order  to  test  the  convergence  of  the  methods  on  a  larger  number  of  subdomains  and  to  test 
the  degree  of  parallelization  that  is  possible  in  practice,  two  PMM  methods,  with  and  without  vertex 
coupling  and  with  Md  as  the  interfacial  preconditioner,  were  programmed  for  the  Intel  Hypercube. 
For  simplicity  and  economy  of  coefficient  storage,  we  considered  only  Poisson’s  equation  on  the 
unit  square.  For  programming  convenience,  efficient  use  of  the  processors  and  economy  of  data 
transfer  between  nodes,  we  wrote  a  single  program  to  run  on  each  of  the  nodes,  rather  than  writing 
special  purpose  programs  to  do  the  major  identifiable  operations  in  Algorithm  PCG,  such  as  interior 
solves  or  edge  solves.  The  same  program  was  used  to  run  both  strips  and  boxes,  with  different 
compile-time  constants  to  reshape  the  arrays.  The  number  of  processors  was  left  dyna  uc. 

4.1.  Domaln-to-processor  mapping  and  dataflow  analysis 

The  square  domain  is  uniformly  discretized  with  n  subintervals  on  a  side  and  divided  into  p 
subdomains  in  one  of  two  ways.  In  the  case  of  domains  consisting  of  strips,  each  domain  is  an  nxn/p 
rectangle.  In  the  case  of  boxes,  each  domain  is  an  n/.yp  xn/^p  square.  For  simplicity  and  efficiency 
we  require  n  and  p  to  be  powers  of  2.  By  restricting  ourselves  to  this  limiting  case  of  decomposition 
regularity,  we  are  able  to  pass  over  the  complex  issue  of  load  balancing.  The  work  per  processor  is 
balanced  a  priori  and  does  not  change  during  the  course  of  the  solution  process  in  response  to  any 
adaptive  mechanism.  It  should  be  noted  that  regular  decompositions  of  this  type  are  the  exception 
in  applications.  The  tradeoffs  between  load  balancing  efficiency  and  communication  efficiency  in 
multiprocessor  implementations  of  adaptive  algorithms  for  partial  differential  equations  have  begun 
to  receive  attention  elsewhere  ([3,  18]).  We  note  however,  that  some  geometrical  adaptivity  can 
be  accommodated  within  the  context  of  logically  uniform  decompositions  by  means  of  generalized 
tensor-product  gridding  (see.  e.g..  the  examples  in  j4]). 

Four  types  of  nodes  are  to  be  distinguished  in  the  decomposition,  for  purposes  of  keeping  track 
of  the  data  flow:  interior  nodes,  separator  set  nodes,  nodes  which  are  adjacent  to  a  separator,  and 
nodes  which  are  adjacent  to  an  exterior  boundary.  For  the  strip  decomposition,  we  associate  the 
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gridpoints  along  each  interface  with  one  of  the  two  subdomains  it  separates  so  that  each  processor 
(except  for  one)  is  assigned  one  of  the  interfaces  of  length  n.  For  definiteness  in  what  follows,  let 
the  strips  run  north-south,  and  let  the  east  face  of  each  subdomain  contain  the  separator  nodes, 
the  west  face  containing  nodes  which  are  adjacent  to  a  separator,  with  the  obvious  exceptions  of 
the  west-most  and  east-most  subdomains.  For  the  box  decomposition,  we  associate  the  gridpoints 
along  each  interface  with  one  of  the  two  subdomains  it  separates  and  each  vertex  with  one  of  the 
four  subdomains  it  separates  so  that  each  processor  (except  for  those  along  two  of  the  domain 
boundaries)  is  assigned  two  interfaces  of  length  and  one  crosspoint.  For  definiteness,  let  the 
east  face  and  south  faces  of  each  subdomain  contain  the  separator  nodes,  let  the  southeast  corner 
contain  the  crosspoint,  and  let  the  west  and  north  faces  contain  nodes  which  are  adjacent  to  a 
separator,  again  with  the  obvious  boundary  exceptions.  We  map  p  strips  onto  the  hypercube  by 
labeling  the  strips  from  one  end  of  the  domain  to  the  other  in  binary  reflected  Gray  code  order 
[I9j.  We  map  p  boxes  onto  the  hypercube  by  f<  rming  the  tensor-product  of  two  such  rings,  along 
the  row  and  column  directions  of  the  subdomain  “grid” .  In  this  way  all  of  the  local  exchanges  in 
the  physical  domain  are  also  local  exchanges  between  processors,  any  two  of  which  are  connected 
if  and  only  if  their  binary  representations  differ  in  exactly  one  bit. 

We  now  analyze  the  data  flow  in  each  of  the  major  steps  in  Algorithm  PCG.  We  concentrate 
on  the  boxwise  decomposition,  since  strips  may  be  regarded  as  a  special  case  thereof. 

The  operation  {PCG.7)  of  applying  the  distributed  A  matrix  consists  of:  (1)  sending  the  vector 
of  data  along  each  processor  boundary  not  adjacent  to  a  physical  boundary  to  the  neighboring 
processor:  (2)  computing  the  interior  components  of  the  product  from  the  5-point  star  formula;  (3) 
receiving  the  vector  of  data  coming  across  each  boundary  and  computing  the  boundary  components 
of  the  product.  Note  that  it  is  not  necessary  to  distinguish  between  separator  nodes  and  those 
adjacent  to  a  separator  in  this  step. 

The  operation  (PCG.  11)  of  solving  with  B  consists  of:  (1)  solving  for  in  the  subdomain 
interior  (the  subdomain  interior  includes  the  processor  boundary  nodes  that  are  adjacent  to  a 
separator);  (2)  sending  the  vector  of  data  along  each  north  and  west  processor  boundary  not 
adjacent  to  a  physical  boundary  to  the  neighboring  processor;  (3)  receiving  the  vector  of  data 
coming  across  each  south  and  east  processor  boundary  and  forming  the  right-hand  side  of  the 
edge  separator  system  and  the  local  summands  of  the  right-hand  side  of  the  crosspoint  system;  (4) 
sending  the  appropriate  scalar  summands  to  the  processors  to  the  north  and  west,  if  any;  (5)  solving 
the  south  and  east  edge  systems  for  on  the  separator  set;  (6)  receiving  the  summands  coming 
from  the  processors  to  the  south  and  east,  if  any,  and  forming  the  right-hand  side  of  the  crosspoint 
system;  (7)  scattering  the  local  component  of  the  right-hand  side  of  the  crosspoint  system  to  all 
other  processors,  and  receiving  their  components  in  turn;  (8)  solving  an  identical  global  crosspoint 
system  in  each  processor;  (9)  extending  the  crosspoint  data  linearly  along  the  separator  set  edges 
to  form  ;  (10)  summing  u'’  and  along  the  separator  set  to  form  on  the  separator  set; 
(11)  sending  the  vector  of  data  along  each  south  and  east  boundary  not  adjacent  to  a  physical 
boundary  to  the  neighboring  processor;  (12)  receiving  the  vector  of  data  coming  across  each  north 
and  east  processor  boundary  and  forming  the  right-hand  side  of  the  interior  system;  (13)  solving 
for  in  the  subdomain  interior  and  summing  and  v^.  Steps  (4),  (6)-(7),  and  (9)-(10)  are 
unnecessary  in  the  decoupled  form  of  the  algorithm,  and  step  (8)  is  replaced  by  a  scalar  divide  local 
to  each  processor  containing  a  crosspoint  in  this  case. 

The  operation  of  performing  a  dot  product  in  {PCG.7)  and  (PCG.  12)  consists  of  two  traversals 
of  a  binary  spanning  tree,  arbitrarily  rooted  in  processor  “0”.  First,  the  local  portion  of  the  dot 
product  is  computed.  These  partial  sums  are  then  passed  from  leaf  to  root,  and  summed  along  the 
way.  The  root  processor  performs  the  final  sum  and  sends  the  data  back  from  root  to  leaf. 
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Table  11:  V^u  =  /  in  the  unit  square,  divided  into  equal 
strips.  Results  for  PMM  with  Md  on  the  interfaces,  as  a 
function  of  problem  size  and  number  of  processors. 


Our  present  implementation  of  the  global  scatter  required  in  distributing  the  right-hand  side 
of  the  crosspoint  system,  in  step  (7)  of  (POG.ll),  consists  of  a  different  scatter  rooted  in  each  node 
which  is  identical  to  the  second  half  of  the  dot  product  routine  described  above. 

4.2.  Numerical  results 

We  present  data  obtained  on  an  Intel  iPSC-7  for  the  strip  form  and  the  two  box  forms  of 
the  algorithm  described  above.  For  simplicity  of  coefficient  generation  and  econom>  of  storage 
(as  memorj'  proved  to  be  constraining),  w'e  considered  only  Laplace's  equation  on  a  square  The 
problem  is  the  same  as  that  in  Tables  1,  8,  and  9. 

Our  conventions  for  the  tests  on  the  iPSC  differ  from  those  in  S3  in  that  the  preronditi<>iie.: 
residual  is  used  to  monitor  convergence,  rather  than  the  actual  residual.  This  is  the  practical  rlKore 
since  we  are  not  comparing  different  preconditioners  against  each  other  in  this  section,  but  rather 
carrying  out  larger  scale  tests  on  selected  algorithms.  Since  monitoring  the  true  residual  would 


require  an  extra  global  dot  product  per  iteration,  the  reported  execution  times  would  exhibit  an 
unnatural  penalty  associated  with  increased  subdivision  of  the  domain.  However,  for  completeness, 
we  do  report  the  unpreconditioned  residual  at  termination. 

The  notation  employed  in  the  tables  follows  that  of  section  3  except  that  p  is  now  defined  as 

T  is  the  total  execution  time  (computation  plus  communication), 
including  the  parallelized  computation  of  the  discrete  forcing  term  and  other  initialization  overhead, 
s,  where  applicable,  measures  the  ratio  of  two  execution  times:  T  for  the  next  smaller  number  of 
processors  on  the  same  size  problem  divided  by  the  current  T .  It  is  thus  a  local  measure  of  speedup, 
and  a  natural  one  for  domain  decomposition  algorithms,  but  it  is  not  the  speedup  as  it  is  defined 
in  a  pure  sense  (see  §5). 

We  comment  briefly  on  the  sparsity  of  the  tables.  Our  program  requires  a  minimum  of  8  subin¬ 
tervals  on  a  side  for  each  subdomain.  Therefore,  the  number  of  processors  that  can  be  employed 
on  a  given  problem  is  bounded  above  by  the  resolution  of  the  mesh.  Unfortunately,  the  number  of 
processors  is  also  bounded  below  by  the  limited  local  memory  of  each  node.  The  largest  number  of 
degrees  of  freedom  that  could  be  accommodated  per  subdomain  was  between  2^^  and  2^^;  therefore, 
our  largest  problems,  involving  2^^  or  2*®  nodes,  could  not  be  run  on  small  numbers  of  processors. 
These  two  restrictions  imposed  a  banded  structure  on  the  region  of  h  -  p  parameter  space  in  which 
it  was  feasible  to  run  experiments. 

The  results  of  a  series  of  runs  using  strips  are  given  in  Table  11.  The  p  —  2  column  of  this  table 
is  comparable  to  the  Md  column  of  Table  1;  however,  the  more  lenient  convergence  criterion  leads 
to  one  less  iteration  in  the  more  highly  refined  cases,  and  the  Lanczos  estimate  of  the  condition 
number  is  obviously  sensitive  to  I  for  /  as  small  as  2  or  3  in  this  problem.  Note  the  effect  of 
Theorem  2.2:  as  the  number  of  processors  is  increased  while  the  refinement  is  held  constant,  k 
goes  up  asymptotically  in  proportion  to  the  square  of  the  number  of  processors  (subdomains).  The 
number  of  iterations  required  does  not  behave  as  badly  as  the  condition  number.  From  the  limited 
data,  it  appears  to  go  up  roughly  in  proportion  to  the  number  of  processors,  with  fixed  refinement. 
The  increasing  number  of  iterations  asymptotically  defeats  the  power  of  the  additional  processors 
as  far  as  s  is  concerned,  however. 

The  results  of  a  series  of  runs  using  boxes  without  crosspoint  coupling  are  given  in  Table  12. 
The  p  =  4  column  of  this  table  is  may  be  compared  to  the  Mp  column  of  Table  8;  again  the 
pref  onditioued  residual  convergence  criterion  is  more  lenient,  but  the  condition  number  estimates 
aRtee  almost  exactly.  In  this  table  d/h  is  constant  along  a  diagonal,  therefore  by  Theorem  2.3  we 
\v(  uli:!  exjiert  to  see  a  proportionality  between  k  and  p  along  a  diagonal,  which  is  approximately 

•  j)^.  The  number  of  iterations  required  goes  up  sublinearly  with  the  number  of  processors 

-II b'iomains  1 .  witli  fixed  refinement. 

Tlie  result-  of  a  series  of  runs  using  boxes  with  crosspoint  coupling  are  given  in  Table  13. 
Ti,'  ;  =4  i  tuiimn  of  this  table  may  be  compared  to  the  Mp  column  of  Table  9,  subject  to  the 
,ii  .  \  «  and  the  p  =  IG  column  shows  close  agreement  with  Table  6.2  of  [4].  According 

•  T:.e  r-ti.  2.3  tli»  condition  number  should  be  bounded  by  a  constant  along  a  diagonal,  which 
;  -  .i:-  •  tiie  <-it.se  The  number  of  iterations  virtually  levels  off  and  the  condition  number 

,  •  iirr'ta-ff  a.-  more  processors  are  added  with  fixed  refinement. 

S  Domain  decomposition  on  parallel  computers 

'  •.  I,  ■’mj'hc.sii’e-  the  application  of  domain  decomposition  to  parallel  computation. 

r  ni.  aiistract  problem  with  no  special  features  sucli  as  complicated  domain  geom- 
:  • :  1’  r  n -it i.i r  iiitt  t,  lictate  the  decomposition  and  ask  what  strategy  leads  to  the  most 
i. '•  ii;,;  i-m'-nt  <tti<in  Seveial  interesting  questions  remain,  of  which  the  most  important 
w  :;.;i  a  -peedup  over  a  siiicle  processor  can  be  obtained  in  practice,  and  how  is  the 
■  ■  '  -ii-  parali-i  alcorithm  affected  by  decomposition  topology,  architecture  topology. 
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145.6 

6.5(-l) 
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Table  12:  V^u  =  /  in  the  unit  square,  divided  into  equal 
boxes.  Results  for  PMM  with  A/p  on  the  interfaces  and 
without  vertex  coupling,  as  a  function  of  problem  size  and 
number  of  processors. 


machine  parameters,  and  algorithmic  options.  To  begin  to  answer  these  questions,  we  construct 
theoretical  computational  complexity  models  of  various  approaches  and  compare  them.  One  of  our 
purposes  is  to  determine  what  a  parallel  computer  suitable  for  domain  decomposition  algorithms 
should  look  like.  This  includes  not  only  the  architecture,  but  the  relative  speeds  of  communication 
and  floating  point. 


(a) 

Figure  8:  The  three  parallel  architectures  under  considera¬ 
tion.  (a)  is  a  ring:  the  processors  are  indicated  by  circles. 

(b)  is  a  mesh;  the  processors  are  at  the  intersections  of  the 
lines,  (c)  is  a  4-D  hypercube;  again  the  processors  are  at  the 
intersections. 

We  continue  to  consider  two  basic  types  of  decomposition  topologies;  strips  and  boxes.  We 
consider  one-to-one  subdomain-to-processor  maps  and  estimate  the  costs  for  a  single  iteration. 
For  comparison,  we  consider  the  same  algorithm  implemented  on  a  single  processor  with  the  same 
number  of  subdomains.  This  allows  us  to  see  only  how  effectively  domain  decomposition  algorithms 
can  be  parallelized.  Whether  domain  decomposition  is  the  best  possible  parallel  algorithm  for 
elliptic  systems  is  beyond  the  scope  of  this  paper. 

As  described  in  §§4.1.  we  consider  a  square  domain  uniformly  discretized  with  n  subintervals 
on  a  side  and  we  assume  a  total  of  p  processors  which  evenly  divide  the  domain  into  either  strips 
or  boxes.  We  further  assume  that  the  A  matrix  comes  from  a  5-point  stencil,  that  the  B  matrix 
solves  can  be  done  with  a  fast  Poisson  solver,  and  that  the  edge  solves  can  be  done  with  FFT’s. 

5.1.  Parallel  processor  model  and  subdomain-to-processor  mapping 

In  any  parallel  computer,  interprocessor  communication  must  be  considered.  We  will  assume 
that  the  time  to  transmit  a  message  of  length  n  to  a  “neighbor’’  takes  time  a  +  /3n.  A  “neighbor”  is 
a  processor  which  is  directly  connected.  We  will  further  assume  that  only  sending  messages  takes 
time;  receiving  is  free,  a  is  the  latency  and  0  is  the  transfer  time.  A  floating  point  operation  takes 
time  /. 

We  consider  three  different  parallel  architectures:  a  ring  (cf.  Figure  8(a)),  a  two  dimensional 
mesh  with  wrap-around  (cf.  Figure  8(b),  wrap-around  not  shown),  and  a  hypercube  (cf.  Figure  8(c)), 
which  embeds  the  first  two.  The  mappings  of  the  strips  onto  the  ring  or  the  hypercube  and  the 
boxes  onto  the  mesh  or  the  hypercube  are  the  natural  ones  discussed  in  §§4.1.  Because  of  the  local 
connectivity  of  the  discrete  elliptic  operator,  the  only  real  difference  between  these  architectures 
for  a  given  decomposition  type  is  their  performance  on  the  global  dot  products  in  the  Conjugate 
Gradient  algorithm  in  steps  [PCG.7)  and  (PCG.12).  The  purpose  of  our  estimates  is  to  suggest 
optimal  relative  sizes  of  a,  0.  and  /  for  efficient  use  of  the  processors. 

5.2.  Complexity  estimates 

Using  the  model  described  above,  we  estimate  the  complexity  of  a  computation  using  a  decom¬ 
position  into  either  strips  or  boxes  on  the  three  architectures  under  consideration.  The  times  are 
shown  in  Table  14  on  a  per  subdomain  per  iteration  basis.  .\.ll  of  the  times  shown  in  the  “Comm. 
Cost"  column  represent  transmissions  of  edge  data  from  one  processor  to  a  neighbor.  Note  that 
since  we  assume  that  nearest  neighbor  communications  are  done  one  at  a  time,  there  is  no  difference 
between  the  hypercube,  the  mesh,  and  the  ring  for  sends  to  a  neighbor  (at  most,  there  is  a  constant 
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Operation  type 
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S-solve 

interior 

0 

Cjn‘^p~^  log(np“') 

2 

edge 

0 

CEn  log  n 

1 

Single  Processor 

Dot  Product 

0 

2 

{p  domains) 

>?-multiply 

0 

5r!2 

1 

Boxes 

S-solve 

interior 

0 

Cjn‘^p~^  log(np~5) 

2 

edge 

0 

CEnp~'i  log(np“2) 

2 

corner 

0 

1 

1 

Parallel  Processor 

Dot  Product 

see  Table  15 
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1 

T&ble  14:  Times  (in  units  of  /)  for  communication  and  com¬ 
putation  for  the  partitioned  matrix  method  on  two  decompo¬ 
sition  topologies  and  various  forms  of  parallel  architectures 
(a*  =  q//-  3'  =  0//)-  The  only  differences  between  pro¬ 
cessors  are  in  the  times  for  dot-products,  which  are  shown 
in  Table  15.  Also  included  are  times  for  a  single  processor 
implementation  of  the  domain  decomposition  algorithm. 


Architecture 

Dot  product  time 

Ring 

2-D  Mesh 
Hypercube 

p(a  -1-  0) 
2v/p(Q'+  /?) 
21ogp(a-f  0) 

Table  15:  Communications  times  for  doing  dot  products. 
These  assume  that  a  single  node  collects  the  data,  performs 
the  dot  product,  and  distributes  the  result. 


factor  of  four  difference  here  between  the  architectures  if  simultaneous  sends  are  allowed).  It  is  in 
the  global  dot  product  that  the  three  architectures  differ  most:  these  times  are  shown  in  Table  15. 
All  of  these  times  are  for  simple  dot-product  algorithms,  and  we  make  no  claim  that  they  are  opti¬ 
mal.  However,  doing  much  better  is  hard;  for  example,  see  [27]  for  a  sophisticated  method  for  doing 
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NA 
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"  T^XTp 

NA 

Table  16:  Value  of  a/l3  =  r  where  local  and  global  commu¬ 
nication  costs  are  the  same. 


the  global  dot  product  which  is  only  a  little  faster  than  the  ‘‘obvious”  algorithm  we  have  used  here. 
The  times  shown  in  the  “Comp.  Cost”  column  represent  the  leading  order  operation  counts  (in 
scalar  multiply-adds)  of  the  various  “Operation  types”  only,  and  are  accurate  for  asymptotically 
large  n.  When  p  is  simultaneously  large,  lower-order  terms  in  n  should  arguably  be  retained,  but 
we  have  not  made  the  attempt  here.  Thus  the  apparent  reduction  in  serial  complexity  in  going 
from  a  global-domain  ftist  Poisson  solves  (cost  C/n^  log  u)  to  a  set  of  p  subdomain  fast  Poisson 
solve  (cost  p  C/n(np~*)  log(np~^))  should  not  be  regarded  as  an  advantage  that  can  be  exploited 
by  taking  p  large.  The  0(n^}  terms  of  the  fast  Poisson  solver  become  important  in  this  limit. 

While  it  is  tempting  to  discard  all  of  the  communication  terms  because  they  are  asymptotically 
in  n  unimportant,  it  turns  out  that  because  n  and  p  may  be  roughly  the  same  size  and  that  q  and/or 
3  may  be  much  larger  than  /,  this  can  be  very  misleading.  In  the  rest  of  this  section,  we  will  look 
at  where  in  the  parameter  space  of  n.p.  a, and  /  local  or  global  communication  is  as  expensive  as 
computation.  We  will  do  this  by  first  determining  where  local  or  global  communication  dominates 
the  communication  cost,  then  looking  at  where  either  local  or  global  communication  dominates 
computation.  Finally,  we  will  estimate  the  performance  of  several  possible  computer  architectures 
by  picking  representative  values  of  a.  and  /. 

The  local  communication  consists  only  of  edge  exchanges,  and  takes  time  4(q  -f  /3n)  for  strips 
and  time  8(q  -|-  0n/^)  for  boxes.  The  global  communication  consists  only  of  dot  products,  and 
is  of  the  form  C£)(q  +  B),  where  Cd  —  2p  for  a  ring,  Co  =  4^  for  a  mesh,  and  Cd  =  4logp  for  a 
hypercube. 

We  can  determine  the  ratio  r  =  a/B  where  these  two  costs  are  equal  as  a  function  of  n  and 
p.  The  formulas  are  shown  in  Table  16.  Contour  plots  of  the  four  cases  presented  in  Table  16  are 
shown  in  Figure  9.  These  plots  clearly  show  that  a  hypercube  can  afford  a  higher  ratio  of  a/B  than 
either  a  ring  or  a  mesh,  but  that  (particularly  for  boxes)  the  advantage  is  not  very  great  for  typical 
ranges  of  p  and  n. 

Also  of  interest  is  where  the  ratios  a/ f  and  B/ f  are  such  that  the  communication  time  is 
equal  to  the  computation  time.  We  can  look  at  this  by  considering  just  the  dominant  terms  in  the 
communication  and  the  computation. 

The  dominant  computation  term  for  strips  is  2Cin^p~^  log(np“*)  and  the  dominant  commu¬ 
nication  term  is  either  the  local  term  4(q  -I-  Bn)  or  the  global  term  CD{a  -I-  B)-  The  dominant 
computation  term  for  boxes  is  2C[n^p~^  log(np~5),  and  the  dominant  communication  term  is  ei¬ 
ther  the  local  term  8(q-I-  Bnp~^)  or  the  global  term  Co(Q'-t-  B)-  Table  17  summarizes  these  results 
and  Figure  10  gives  contour  plots  of  these  values.  To  generate  these  and  subsequent  results,  we 
have  for  convenience  set  Cj  =  Ce  =  5.  which  are  representative  of  cyclic  reduction  for  separable 
symmetric  discrete  elliptic  systems  and  a  pair  of  sine  transforms,  respectively. 

Because  of  the  complexities  of  these  formulas,  we  have  constructed  graphs  of  the  speedup, 
fraction  spent  in  communication,  and  efficiency  for  several  hardware  implementations  (values  of  q. 
3.  and  /)  for  each  of  these  architectures.  We  define  the  speedup  as  the  ratio  of  the  time  to  do  a  single 
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Figure  9:  Contour  plots  of  r  =  ratio  of  a//?  where  global 
communication  takes  the  same  time  as  local  communication. 
If  for  a  given  (p.  n).  a/0  is  larger  than  the  value  on  the  plot, 
then  global  communication  is  more  important;  otherwise,  lo¬ 
cal  communication  is  dominant.  Alternatively,  for  a  given 
r,  local  communication  dominates  above  the  contour,  and 
global  communication  below. 
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how  well  a  parallel  computer  is  being  used  and  gives  an  indication  of  how  much  value  there  is  in 
increasing  the  number  of  processors. 

In  order  to  provide  some  comparisons,  we  will  consider  all  four  combinations  of  parallel  archi¬ 
tecture  and  domain  decomposition  with  the  same  parameters.  This  is  somewhat  unfair  to  the  ring 
and  to  the  mesh,  since  their  simpler  communication  structure  would  allow  more  engineering  effort 
to  be  spent  on  faster  communication  or  computation.  The  values  we  have  chosen  are  meant  to  be 
fairly  representative  of  what  exists  or  could  exist  in  the  near  future. 

These  estimates  of  which  part  of  the  algorithm  dominate  performance  suggest  a  few  sample 
choices  of  the  parameters  q,  3.  and  /.  The  most  favorable  is  a  machine  which  is  not  communication 
dominated.  Such  a  processor  might  realistically  be  constructed  to  have: 


Q  =  100/i  seconds 
3  —  seconds 
/  =  10/y  seconds 


This  makes  each  node  a  100  Kflop  processor,  with  200  Kiloword/second  transfer  speed,  and  an 
I/O  startup  time  of  10  floating  point  operations.  This  machine  has  a  relatively  well  balanced 
communication  and  computation  unit,  and  probably  represents  a  good  compromise.  Of  course,  in 
all  cases  it  is  the  dimensionless  ratios  of  communication  time  to  computation  time,  or  a//  and  3! ! 
which  matter.  The  speed  of  identical  ensemble  architectures  with  the  same  ratios  will  scale  linearly 
with  the  basic  floating  point  speed.  For  the  balanced  processor,  these  values  are  q/ /  =  10  and 
3! J  —  0.5.  Figures  11-12  show  various  results  based  on  the  complexity  estimates  of  Table  1  for 
this  machine,  as  functions  of  n  and  p,  where  n  ranges  from  2^  to  2’°  and  p  from  1  to  2®.  The  data  is 
set  to  zero  over  the  regions  in  which  p  >  n  Note  that,  for  these  parameters,  there  is  little  difference 
between  a  ring  or  a  mesh  and  a  hypercube  (cf.  results  in  Tables  16  and  17).  We  emphasize  that 
these  results  are  all  theoretical.  Note  that  even  with  this  balanced  machine,  communication  is  not 
unimportant  if  decomposition  into  strips  is  used,  because  of  the  long  vectors  of  interface  data  which 
must  be  moved  between  processors.  For  boxes,  w'hose  area-to-perimeter  ratio  is  better  for  large 
p,  the  communication  cost  rapidly  vanishes  for  large  problems  (which  can  effectively  put  to  use 
large  numbers  of  processors).  An  effect  which  we  do  not  take  into  account  here  is  the  potential  for 
vectorization  within  a  subdomain.  Because  of  the  longer  vector  lengths,  stripwise  decompositions 
have  a  tendency  to  be  favorable  from  the  computational  standpoint  if  tasks  local  to  a  processor 
vectorize  well. 

The  first  two  figures  show  the  speedup  in  surface  plot  and  contour  plot  form,  the  next  the 
fraction  of  the  time  taken  in  communication,  and  the  last  shows  the  efficiency.  A  few  remarks  are 
in  order  about  these  graphs.  Recall  that  all  are  on  a  per  iteration  basis. 

A  meaningful  speedup  is  difficult  to  define.  One  measure  would  be  to  compute  the  ratio  of 
the  time  for  the  best  single  processor  algorithm  to  the  time  for  a  parallel  domain  decomposition 
algorithm.  Unfortunately,  there  is  no  consensus  for  the  best  single  processor  algorithm.  Instead,  we 
consider  the  ratio  of  the  time  for  a  single  processor  implementation  of  the  domain  decomposition 
algorithm,  with  the  same  number  of  domains  as  the  parallel  implementation.  This  speedup  measures 
the  opposing  effects  of  less  work  per  processor  versus  more  communication. 

The  plot  of  the  fraction  of  the  time  in  communication  shows  quite  graphically  the  overhead 
due  to  the  distribution  of  the  algorithm  across  many  processors. 

The  plot  of  the  efficiency  provides  a  complementary  visualization  of  this  distribution  overhead, 
which  is  best  amortized  over  large  subdomains  from  the  point  of  view  of  the  per  iteration  cost  factor. 


Figure  11:  Theoretical  parallelization  measures  for  a  hyper¬ 
cube,  using  strips  for  the  domain  decomposition.  The  figures 
for  a  ring  using  strips  are  virtually  identical.  Balanced  pro- 


Figure  12:  Theoretical  parallelization  measures  for  a  hyper¬ 
cube,  using  boxes  for  the  domain  decomposition.  The  figures 
for  a  mesh  using  boxes  are  virtually  identical.  Balanced  Pro- 
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Another  choice  for  the  parameters  would  represent  a  faster  processor,  purchased  at  the  expense 
of  communication  startups.  Such  a  processor  might  have  parameters: 

a  =  2500/i  seconds 
0  =  2.5/i  seconds 
f  =  Sfj  seconds 

Thus,  a/f  =  500,  0f f  =  0.5,  and  a/0  =  1000.  Note  that  in  such  a  machine,  global  communication 
will  dominate  local  communication  (cf.  Figure  9)  and  it  will  often  be  communication  dominated 
(cf.  Figure  10),  particularly  for  strips.  The  results  for  this  processor  are  shown  in  Figures  13-16. 

For  simplicity,  all  of  the  plotted  results  for  decompositions  of  box  type  are  based  on  the 
communication-free  form  of  the  crosspoint  solves  which  are  needed  as  part  of  applying  S”'.  As 
discussed  theoretically  in  §2  and  confirmed  experimentally  in  §§4.2.  this  preconditioning  allows  the 
overall  iteration  count  to  grow  as  p  increases.  Therefore,  we  consider  the  more  implicit  algorithm 
due  to  Bramble  et  al  [4].  This  method  requires  that  a  linear  system  of  size  p  with  bandwidth  ^ 
involving  values  at  all  the  corner  points  be  solved  at  each  iteration.  The  computational  cost  of 
assembling  and  solving  this  extra  system  of  equations  is  0(n/ ^Jp)  -|-  O(p^).  This  is  negligible  for 
practical  ranges  of  p  and  n  in  comparison  to  the  computational  work  required  for  the  interior  solves. 
However,  the  communication  required  to  set  up  the  right-hand  side  demands  a  close  look.  Each 
processor  must  first  exchange  a  scalar  with  each  of  the  two  neighbors  whose  interfaces  emanate 
from  its  crosspoint  to  form  its  component  of  the  right-hand  side.  Then  all  of  the  data  associated 
with  each  crosspoint  must  be  scattered  to  every  processor,  each  of  which  solves  an  identical  system. 
Because  of  the  small  size  of  the  system,  this  is  faster  than  distributing  the  matrix  across  all  the 
processors,  unless  communication  is  extremely  inexpensive.  In  Table  18  we  show  the  additional 
costs  involved  in  handling  the  corner  points  based  on  the  simplest  multiple  scatter  algorithm  from 
[27j.  The  costs  assigned  are  not  necessarily  optimal,  but  they  are  representative. 

As  can  be  seen  from  comparison  with  Tables  15  and  14,  these  local  and  global  costs  are 
comparable  to  or  less  than  the  costs  for  the  local  and  global  communication  already  present  in  the 
algorithm,  provided  that  a/0  is  not  less  than  0(y/p)  for  the  mesh  and  not  less  than  O(p/logp) 
for  the  cube.  The  region  of  parameters  we  have  investigated  in  this  section  lies  up  against  these 
respective  boundaries,  but  does  not  violate  them  severely.  Therefore,  solving  the  crosspoint  system 
may  roughly  double  the  communication  time  when  p  is  at  the  upper  end  of  our  range.  This  will  be 
significant  if  communication  is  already  expensive  but  it  will  not  bring  down  the  plateaus  of  high 
efficienc}’  in  Figures  12,  14  and  16.  Thus,  solving  the  corner  point  problem  will  not  be  a  significant 
additional  cost  on  machines  already  operating  in  favorable  regions  of  parameter  space. 

6.  Conclusions 

Our  conclusions  fall  into  two  categories:  those  pertaining  to  domain  decomposition  algorithms 
themselves,  and  those  pertaining  to  domain  decomposition  as  an  example  of  a  general  parallel 
computation. 

6.1.  Domain  Decomposition  Algorithms 

Two  basic  forms  of  the  algorithm  (SCM  and  PMM).  different  decomposition  topologies,  and 
many  preconditioners  for  the  separator  set  equations,  each  with  advantages  and  disadvantages, 
make  the  study  of  domain  decomposition  surprisingly  rich,  since  it  is  rooted  in  rather  classical  and 
straightforward  ideas. 

Because  of  its  lower  computational  complexity,  we  prefer  the  SCM  when  it  is  applicable,  namely 
when  the  subdomain  solves  can  be  handled  exactly.  In  this  case  the  full  PMM  iterates  reduce  to 
the  SCM  iterates  and  the  latter  can  be  obtained  more  cheaply.  The  PMM  is  a  more  flexible 


Figure  IJ:  Theoretical  parallelization  measures  for  a  ring 
using  strips  for  the  domain  decomposition.  Communication 
dominated  processor. 


Figure  14:  Theoretical  parallelization  measures  for  a  mesh 
using  boxes  for  the  domain  decomposition.  Communication- 
dominated  processor. 
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Comm.  Type 

Crosspoint  Time 

Assembly,  either  arch. 
Scatter,  2-D  Mesh 
Scatter,  Hypercube 

2(a  +  0) 

2api  +  0[pi  +  p) 

2q  logp  -f  0p 

Table  18:  Communication  times  for  the  implicit  handling  of 
the  crosspoint  system. 


algorithm  for  general  symmetric  scalar  problems,  but  many  generalizations  are  needed,  especially 
to  nonsymmetric  systems  of  equations,  where  some  generalized  form  of  preconditioned  conjugate 
gradients,  such  as  the  generalized  minimum  residual  method,  will  come  into  play,  and  where  the 
optimal  preconditioners  for  the  symmetric  positive  definite  version  of  the  algorithm  will  have  less 
value. 

The  optimal  preconditioners  are  to  be  prefered  to  the  MSC  preconditioners  over  the  full  range 
of  examples  contained  herein,  but  the  latter  offer  more  possibilities  for  generalization  and  show  at 
least  some  promise  in  dealing  with  coefficient  irregularity. 

In  terms  of  convergence  properties,  boxwise  decompositions  have  been  demonstrated  preferable 
to  stripwise  decompositions  for  large  numbers  of  processors,  and  the  crosspoint-coupled  version  of 
the  algorithm  is  preferable  from  the  point  of  view  of  iteration  count. 

Our  parallelization  analysis  has  revealed  the  important  practical  result  that  for  reasonably 
achievable  machine  parameters  and  reasonably  large  problems,  boxwise  decompositions  can  be 
made  more  efficient  than  stripwise  decompositions  per  iteration,  in  addition  to  offering  better 
convergence  properties;  and  that  when  domain  decomposition  is  efficiently  parallelizable  at  all.  the 
crossoint-coupled  version  of  the  algorithm  is  not  much  less  efficient  than  the  decoupled  version. 
This  argues  for  further  research  emphasis  in  the  direction  of  more  implicit  preconditioners. 


6.2.  Domain  Decomposition  as  an  example  of  parallel  computation 

The  purpose  of  studying  parallel  computation  is  to  not  merely  to  invent  and  analyze  new 
algorithms,  but  also  to  make  suggestions  about  what  types  of  parallel  computers  will  be  effective 
for  scientific  computing.  Domain  decomposition  provides  one  set  of  non-trivial  algorithms  which 
test  many  features  of  parallel  computers.  An  analysis  of  domain  decomposition  as  a  model  of 
a  general  parallel  computation  reveals  some  interesting  information  about  distributed  memory 
parallel  computer  design.  First,  the  time  to  start  an  I/O  operation  must  not  be  too  much  larger 
than  the  time  to  actually  send  a  single  word.  For  the  range  of  problems  considered  theoretically, 
the  startup  time  a  should  not  be  more  than  an  order  of  magnitude  larger  than  the  transfer  time  0. 
As  shown  in  Figure  9,  if  the  startup  time  is  much  larger  than  this,  then  global  communication  will 
dominate  the  communication  and  a  different  communication  structure  for  the  parallel  computer 
may  be  more  cost  effective. 

The  “best”  ratio  of  communication  speed  to  computation  speed  is  very  dependent  on  the 
problem  size  and  the  decomposition  granularity.  The  arithmetic  complexity  of  elliptic  PDEs  is 
such  that  for  moderate  to  coarse  granularity  communication  will  not  dominate  computation  as 
long  as  communication  speeds  are  comparable  to  the  floating  point  speed.  However,  either  a  large 
startup  time  or  a  slow  transfer  rate  will  significantly  degrade  performance  (cf.  Figures  13-16). 
Figure  10  shows  that  both  a  and  0  may  be  an  order  of  magnitude  or  so  slower  than  /  without 
making  communication  a  dominant  part  of  the  cost.  This  result,  with  the  results  on  a/0,  argues 
for  parallel  processors  with  fast  nodes  and  communication  which  may  be  a  little  slower  than  the 
computation.  Therefore,  engineering  effort  should  not  be  spent  making  communication  as  fast  as 
computation  unless  the  cost  of  doing  so  is  low. 


In  terms  of  architectures,  our  results  argue  in  favor  of  hypercubes  for  domain  decomposition. 
However,  the  advantage  of  the  hypercube  over  the  mesh  for  the  boxwise  decomposition  can  be 
slight  in  the  region  of  parameter  space  where  high  efficiency  is  found,  so  if  significant  engineering 
improvements  can  be  obtained  at  the  expense  of  the  richer  interconnect  of  the  hypercube,  such  a 
compromise  is  worth  considering. 

Our  frustrating  experiences  in  trying  to  make  a  relatively  small  program  and  its  data  area  fit 
on  a  hypercube  for  problems  of  practical  dimensions  leads  to  a  final  comment  concerning  large- 
scale  parallelism  in  scientific  computing.  In  an  MIMD  computer,  large  amounts  of  memory  are 
devoted  to  storing  multiple  images  of  the  user’s  program.  For  example,  in  our  tests,  the  program 
itself  occupied  about  200  Kbytes,  not  counting  data  areas.  Since  our  test  problem  was  idealized 
in  many  respects,  it  is  easy  to  see  that  a  genuine  production  program  would  need  substantially 
more  memory.  In  fact,  4  Megabytes  per  node  might  not  be  adequate.  A  256  processor  system  with 
this  much  memory  per  node  would  have  a  total  of  1  Gigabyte  of  memory,  much  of  which  would  be 
used  to  store  copies  of  the  user’s  program  and  the  operating  system.  This  suggests  that  large  scale 
parallel  computers  either  follow  a  SIMD  model  or  provide  some  way  to  share,  perhaps  in  clusters, 
memory  for  the  user’s  program  and  operating  system. 
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